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The Effect of Hammer Size on Efficiency in the Task of Nailing * 


Stewart J. Briggs, E. J. McCormick, and N. C. Kephart 
Occupational Research Center, Purdue University 


Any hardware store salesman “knows” 
what size and type of hammer to use with 
different sizes and types of nails. On the ba- 
sis of the intuitive knowledge that impreg- 
nates the atmosphere of any hardware store, 
the salesman will sell to the home craftsman 
a small hammer to use with small nails and 
a large hammer to use with larger nails. 
There has apparently never been any em-* 
pirical evidence, however, to verify or deny 
the salesman’s judgment on these matters. 
This study was designed to provide at least 
a fragment of such empirical evidence. 

More specifically the investigation was car- 
ried out to determine the relationships, in 
terms of efficiency in nailing, between sizes 
and types of hammers and sizes and types of 
nails as used by home craftsmen. Six ham- 
mers were used in the experiment, four of 
them being claw hammers and two rip ham- 
mers. Five sizes of finishing nails and five 
sizes of common nails were used. 


Experimental Procedures 


While it would have been desirable to es- 
tablish conditions that simulated those which 
the home craftsman would meet, it was not 
possible to accomplish this objective entirely 
because of the need to exercise experimental 
controls. 


Pilot Study. A pilot study was carried out 
with one subject. On the basis of the pilot 
study, certain observations were made and these 
were used in developing the procedures for the 
experiment proper. Following are the observa- 
tions that resulted from the pilot study: 


* Appreciation is expressed to Mr. L. A. O’Connor, 
Store Manager, and Mr. Myron Burkenpas, Man- 
ager of the Hardware Department, Sears Roebuck 
and Company, Lafayette, Indiana, for the loan of 
the hammers for this experiment. 


1. The measured time of the task was a more 
suitable criterion of performance than number of 
strikes of hammer since it takes into account the 
effect of bent nails. 

2. The wood used should be of uniform grain 
and of medium hardness. 

3. The optimum number of nails to be driven 
for each nail-hammer combination was about 
three. 

4. Rest periods were necessary to reduce vari- 
ance due to fatigue. 

Subjects. Six subjects were used in the ex- 
periment. All of the subjects selected had had 
experience as home craftsmen, yet not as profes- 
sional carpenters. The subjects were all males 
between the ages of 21 and 39 years, and were 
associated with Purdue University; one was a 
professor of psychology, four were graduate stu- 
dents in psychology, and one was an undergradu- 
ate in the field of engineering. 

Materials. The following materials were used: 

1. Six hammers were used: four were classi- 
fied commercially as 7, 10, 13, and 16 oz. claw 
hammers; and two were 16 and 20 oz. rip ham- 
mers. The hammers were marked with letters 
for identification. It should be noted that weight 
size refers to the weight of the hammer head, 
and the terms “claw” and “rip” refer to the shape 
of the head. 

2. Nails of the following types were used: 4, 
6, 8, 10, and 16 penny common wire nails, and 
2, 4, 6, 8, and 10 penny wire finishing nails. The 
nails varied in length by half inch intervals and 
increased in gauge with the larger sizes. The 
finishing nails were of smaller gauge than their 
penny equivalents in common nails. 

3. One eight foot top grade fir 2 X 4 per sub- 
ject. 

4. Two sawhorses approximately 34 inches in 
height equipped with a wooden groove to hold 
the 2 X 4 in place during the experiment. 

5. Nail containers: one wooden nail bin of 
nine compartments and one can for holding the 
largest nails. Each container was marked with 
the size of the nail it contained. 

6. One table on which the nail bins were placed. 
positioned to place the nails within easy reach of 
the subject. 
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7, One stop watch calibrated in hundredths of 
a minute. 

Warm-up Period. Each subject was allowed a 
short warm-up period during which he drove up 
.to a total of ten nails of various sizes using three 
or four different hammers. This warm-up pe- 
riod ended when the subject said he was ready 
to begin the experiment. 

Experimental Sequence. Each subject drove 
nails of each of the ten types and sizes with 
each of the six hammers, making a total of sixty 
combinations of nails and hammers for each sub- 
ject. A deck of sixty IBM cards was prepared 
for each subject, each card representing a com- 
bination of one nail type and size and one ham- 
mer. These cards were thoroughly shuffled, and 
the subject followed the randomized order that 
resulted from this shuffling; as the subject com- 
pleted any one combination of nail and hammer, 
the experimenter would tell him what combina- 
tion to use next. Three nails of each type and 
size were driven by the subject. The experi- 
menter timed the subject on the total time re- 
quired to drive the three nails from the time the 
subject grasped the first nail until he had com- 
pleted — the third one. The time records 
were recorded on the cards. 
were later punched into these cards for use in 
statistical. analysis.) 

Short rest pauses of approximately one-half 
minute were introduced between each of the 
sixty combinations. Two longer rest periods of 
ten mihutes divided the experimental time into 
roughly three equal intervals. | 

Instructions to the Subject. The following in- 
structions were given to the subject: 

“The task involves driving nails into this 2 x 4. 
You are to drive the nails in sets of three; that 
is, I will measure the time from the instant you 
grasp the first nail until you have finished driv- 
ing the third. Drive each nail until its head is 
flush with the board before driving the next. 
Try not to mar the wood. If the nail starts to 
bend, try to correct it; and if it seems too bent, 
pull it out and use another nail in its place. 

“Before each set I will tell you which hammer 
and nail to use. These are identified by the let- 
ters on the hammer and the numbers on the com- 
partments of the nail bins. You will then select 
the proper hammer and hold it in the hand you 
wish to hammer with. When, you have located 
the proper nail bin, say ready ‘after which I will 
say go. Then grasp the nail and start hammer- 
ing. Drive the nails as fast as possible, remem- 
bering that bent nails will slow you up. Are 
there any questions?” 


Results 


The data were treated statistically using an 
_ analysis of variance to identify significant 
variables and interactions. The data were 


(The time records . 


Table 1 


Analysis of Variance 








Mean 


Square 


155.1 
734.3 


Source 





Subjects 

Hammers 
Nails 2344.4 
HammersX Subjects * 43.4 
Nails X Subjects 46.7 
Hammers X Nails 73.0 
Hammers Nails X Subjects 40.2 


Total 





Special Analyses 





Mean 


Square 


, Source 





Hammers 
16 oz. claw—16 oz. rip 
Nails (4, 6, 8, 10 oamiet 
Finishing—Common 


6.07 


417.9 1 8.94** 





** Denotes significance at the 1% level of confidence. 


further treated using the process described 
by Tukey * to break up the data into signifi- 
cantly different groups. 

Analysis of Variance. The results of the 
analysis of variance are presented in Table 1. 
These findings may be interpreted as showing 
that the variance within the different sizes of 
hammers as well as different sizes of nails is 
statistically significant. Only one claw ham- 
mer and one rip hammer were of comparable 
size (16 oz.) and they were found not to be 
significantly different. There was a signifi- 
cant difference between the two types of nails 
(common and finishing) when only those 
sizes represented in both types (4, 6, 8, 10 
penny) were considered. The finishing nails 
were driven more slowly than were their 
penny equivalents in common nails. 

There was no significant interaction either 
between hammers and subjects or between 
nails and subjects. There was found a sig- 
nificant variance ratio in the hammer by nail 
interaction. That is, certain hammer and 
nail combinations can be considered better 
than others when driving time is used as a 
criterion. To locate these specific combina- 
tions, the Tukey process was used. 


1 Tukey, J. W. Comparing individual means in 
the analysis of variance. Biometrics, 1949, §, 99-114. 
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Table 2 
Hammer Size Groups for Specific Nails 








Sub- 


Hammer group 


Hammer 





16 Claw 
20 Rip 
13 Claw 
16 Rip 
7 Claw 
10 Claw 


16 Claw 

20 Rip 

16 Rip 
6C 13 Claw 
6C 10 Claw 
7 Claw 


8C 13 Claw 
&C 16 Claw 
8C 20 Rip 
8C 16 Rip 
8C 10 Claw 
8C 7 Claw 


10 C 16 Claw 
10 C 20 Rip 
10 C 16 Rip 
10 C 13 Claw 
10 C 10 Claw 
10 C 7 Claw 


16C 20 Rip 
16C 16 Rip 
16C 13 Claw 
16C 16 Claw 
16C 10 Claw 
16C 7 Claw 


16 Rip 
13 Claw 
10 Claw 
7 Claw 
16 Claw 
20 Rip 


16 Rip 
20 Rip 
16 Claw 
13 Claw 
10 Claw 
7 Claw 


20 Rip 
16 Claw 
16 Rip 
13 Claw 
7 Claw 
10 Claw 


16 Rip 
20 Rip 
16 Claw 
13 Claw 
7 Claw 
10 Claw 


16 Rip 
13 Claw 
16 Claw 
20 Rip 
10 Claw 
7 Claw 





Legend: 
Nail Type: 


Number refers to penny size (2=2 penny, 4 = 4 penny, etc.). 


C = Common wire nail; 
Hammers: 

Number refers to size (16 = 16 oz. hammer, etc.). 
Subgroups: 


F = Finishing wire nail. 


* = No significantly different subgroups formed (at 1% level). 
I = Subgroup with significantly faster driving time (at 1% level). 
II = Subgroup with significantly slower driving time (at 1% level). 


Tukey Process.2. The data presented in 
Tables 2 and 3 represent the results of em- 
ploying the technique developed by Tukey 
for dividing a group of means into signifi- 
cantly different subgroups. Table 2 shows 
for each nail size the subgroups that oc- 
curred between the means of the different 


2 Tukey, J. W. Op. cit. 


hammers, i.e., given a nail of a certain type 
and size, which hammer or hammers are the 
best? Table 3 shows the subgroups occurring 
between nails when each hammer was used. 
As the number of subgroups formed is de- 
pendent upon the variance of the whole 
group, the number of subgroups is not con- 
stant. 
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In Tables 2 and 3 the significantly differ- 
ent subgroups are identified with Roman 
numerals. With two subgroups numerals I 
and II are used; with three subgroups, nu- 
merals I, II, and III. An asterisk (*) is used 
where no subgroups were found. 


On the basis of the subgroups that were 
found, it is possible to make certain general 
recommendations with regard to hammer-nail 
combinations. In Table 2 it will be noted 
that for three nail sizes significantly different 
subgroups of hammers were formed; in each 


Table 3 


Significantly Different Hammer Subgroups for Specific Nails 








Sub- 
group 


Mean 


Hammer Time 





7 Claw 
7 Claw 
7 Claw 
7 Claw 
7 Claw 
7 Claw 
7 Claw 
7 Claw 
7 Claw 
7 Claw 


10 Claw 
10 Claw 
10 Claw 
10 Claw 
10 Claw, 
10 Claw 
10 Claw 
10 Claw 
10 Claw 
10 Claw 


25.17 
26.33 
30.00 
33.00 
33.50 
39.33 
40.17 
51.83 
52.50 
66.67 


25.33 
26.00 
29.17 
31.83 
35.50 
36.33 
40.83 
43.67 
45.17 
54.33 


13 Claw 
13 Claw 
13 Claw 
13 Claw 
13 Claw 
13 Claw 
13 Claw 
13 Claw 
13 Claw 
13 Claw 


23.50 
26.00 
27.50 
28.67 
30.33 
30.67 
34.83 
37.00 
37.33 
47.00 


Legend: 
Nail Type: 





Hammer 


Nail 
16 Claw 4c 
16 Claw 6C 
16 Claw 4F 
16 Claw oF 
16 Claw 2F 
16 Claw 8C 
16 Claw 8 F 
16 Claw 10C 
16 Claw 10 F 
16 Claw 16C 





16 Rip 2F 
16 Rip 4c 
16 Rip 4F 
16 Rip 6C 
16 Rip 
16 Rip 8 F 
16 Rip 8c 
16 Rip 
16 Rip 
16 Rip 


el 


36.67 
46.17 


23.17 
26.83 
27.00 
28.33 
31.33 
33.33 
34.00 
34.17 
40.17 
41.50 


20 Rip 
20 Rip 
20 Rip 
20 Rip 
20 Rip 
20 Rip 
20 Rip 
20 Rip 
20 Rip 
20 Rip 


- 
o— ee ee ee | 


| 


Number refers to penny size (2 = 2 penny, 4 = 4 penny, etc.). 


C = Common wire nail; F = Finishing wire nail. 
Hammer: ’ 

Number refers to size (16 = 16 oz. hammer, etc.). 
Subgroups: 


I = Subgroup with significantly faster driving time (at 1% level). 
+ If = Subgroup significantly slower than I (at 1% level). 
“TIT = Subgroup significantly slower than I and IT (at 1% level). 
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such case the hammers in subgroup I are to 
be recommended over those in subgroup II. 

From an overview of Table 2 it may be 
concluded that the 10 and 7 oz. hammers 
would not be good general purpose hammers 
under conditions similar to those in this ex- 
periment. The 16 oz. and 20 oz. rip ham- 
mers, and the 13 and 16 oz. claw hammers 
appear to have better all-around character- 
istics. 

It will be observed in Table 3 (which deals 
with subgroups of nails for individual ham- 
mers) that significant subgroups of nails were 
formed for each hammer. In some cases two 
subgroups were formed; in such cases the 
nails included in subgroup I were driven sig- 
nificantly faster with the hammer in question 
than those nails in subgroup II. In the case 
of other hammers, three subgroups of nails 
were formed; in such cases the nails included 
in subgroup II were driven faster than those 
in subgroup III, and those in subgroup. I 
were driven faster than those in II. A gen- 
eral observation of Table -3 would suggest 
that smaller nails were driven faster than 
larger nails, which of course is to be ex- 
pected. It might be noted that the 4 penny 
common nails were in subgroup I for all ham- 
mers. 

It should be stressed that the data in Table 
2 are applicable for the situation where nails 
of a given size and type are to be driven, and 
it is desired to select the most efficient avail- 
able hammer for the job; this covers most 
home craftsman situations. However, it is 
conceivable that situations would arise where 
various nails could be used equally well, 
where they are to be used in quantities, and 
possibly where there is a limited choice of 
hammers; in such a situation the data in 
Table 3 would be appropriate since they show 
the relative speeds with which various nails 
were driven with specified hammers. 


Discussion 


The results are not entirely consistent with 
the salesman’s intuitive judgment. The large 
hammers were found to be better with the 
larger nails; however, for smaller nails, the 
smaller hammers were not significantly bet- 
ter. This may be a function of the range of 


nail sizes; if small brads had been included, 
the smaller hammers might have been found 
to be better, although it should be noted that 
two penny finishing nails (which were used 
in the experiment) are quite small, being only 
one inch long and of small gauge. 

’ The two 16 oz. hammers were expected to 
have the same hammering characteristics as 
they differ only slightly in the shape of the 
nail pulling part of the hammer head. This 
small deviation would not be expected to af- 
fect the balance of the hammer seriously, and 
no significant differences were found between 
these two types of hammers. 

The statistically significant difference be- 
tween common and finishing nails was not 
entirely expected. It was thought at first 
that the greater diameter of the common 
nails would offer greater resistance and hence 
slow up the driving. However, this same 
greater diameter presumably tended to re- 
duce the time lost due to nail bending. It is 
also possible that the appearance to the sub- 
jects of greater frailty of the finishing ‘nails 
may have made them somewhat more ¢autious 
(and therefore slower) in driving the finish- 
ing nails. 

It should be kept in mind that the experi- 
ment was conducted using only fir. While 
this is a commonly used wood by the home 
craftsman, the results cannot be generalized 
with assurance to harder or softer woods. It 
might be hypothesized that the results are 
more general than this experiment indicates, 
as the relationship of the weight of the ham- 
mer to the bending resistance of the nail 
might be more crucial than the hardness of 
the wood. If this were true, it would indicate 
that the skill of the hammerer is most impor- 
tant in nailing into harder woods; but the 
relationship of the hammer and nail would be 
the same. Further research would, of course, 
be required to explore such variables. 

It is recognized that time is not necessarily 
the best criterion of performance for every 
situation; for instance, in cabinet work or 
finish carpentry, lack of mars in the wood 
undoubtedly would be a better criterion of 
performance than speed. It should be noted 
that in this experiment an attempt was made 
through instructions and reminders to con- 
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trol the marring of the wood, but this. was 
not completely successful. 

The entire field of study of the tools of the 
home craftsman is lacking in systematic in- 
vestigation. The methods of analysis of 
variance and Tukey’s process appear to be 
powerful tools for study in this area because 
they allow more than one variable to be 
studied at a time and yet permit specific 
recommendations to be made. In studies of 
this field,, it seems advisable to plan a pilot 
experiment (as the one carried out in this 
study) to locate and control some of the un- 
expected experimental difficulties so they will 
not interfere with the main study. 


Summary 


This study was carried out to determine the 
relationship in terms of efficiency of use be- 
tween six hammers (7, 10, 13, and 16 oz. 
claw and 16 and 20 oz. rip hammers) and 
ten nails (4, 6, 8, 10, and 16 penny commgn 
and 2, 4, 6, 8, and 10 penny finishing nails) 
when used by home craftsmen. The six sub- 
jects were home craftsmen without profes- 
sional carpentering experience. The subjects 
drove a set of three nails into a fir 2 x 4 for 


each of the sixty possible hammer and nail 
combinations. Time was the criterion of 
performance. 

Analysis of variance was used on the data, 
and the results indicated: 


1. The variance in time dite to the different 
hammers was statistically significant. 

2. The variance in time due to the different 
nails was statistically significant. 

3. There was no statistically determined 
difference between ‘ie 16 oz. rip and claw 
hammers. 

4. The finishing nails were slower to drive 
than the common nails. 

5. The variance in time due to nail by 
hammer interaction was significant. 

The data were further treated by Tukey’s 
process to locate various significant sub- 
groups of hammer-nail combinations. Spe- 
cific recommendations were made considering 
first the hammer, then the nail, as the inde- 
pendent variable. 

The methods used were felt to be appli- 
cable to other research in the field of home 
craftsman’s tools. 


Received April 23, 1953. 
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A Note on “Predicting Success in Elementary Accounting” 


Robert Jacobs 
Educational Records Bureau, New York, N. Y. 


A study reported by O. R. Hendrix in the 
April, 1953 issue of J. appl. Psychol.’ com- 
pared the validities of the 1947 Edition of 
the ACE Psychological Examination, the 
Ohio State University Psychological Test 
(Form 23),.and the latest form (Form C) 
of the Orientation Test used in the account- 
ing testing program sponsored by the Ameri- 
can Institute of Accountants. The criterion 
for validity was grades received in elementary 
accounting by 76 men and 19 women in the 
College of Commerce and Industry of the 
University of Wyoming. 

The correlations reported jn this study be- 
tween accounting grades and the scores on 
the different tests ran somewhat higher for 
the ACE Psychological Examination and the 
OSU Psychological Test than for the AIA 
Orientation Test. 

On the basis of the data which he obtained, 
Hendrix concluded that “If a single test is to 
be utilized in pre‘icting grades in elementary 
accounting, ACE Psychological Examination 
and OSU Psychological Test are preferable to 
the AIA Orientation Test.” The author of 
the study points out that the investigation 
was restricted to the relationship between the 
test scores considered and grades and that a 
different pattern of relationship might be 
found if the criterion of validity were success 
in actual professional employment as an ac- 
countant. 

However, the relative superiority of the 
AIA Orientation Test when compared with 
tests of general scholastic ability in predict- 
ing success in accounting study is a matter of 
concern to counselors and to teachers when 
the Orientation Test is used in the College 
Accounting Testing Program. 

A considerable amount of research data has 
been accumulated at the project office relat- 
ing to the reliability and validity of the tests 
used in the College Accounting Testing Pro- 


10. R. Hendrix. Predicting success in elementary 
accounting. J. appl. Psychol., 1953, 37, 75-77. 


gram. Some of these data are the result of 
research carried out at the project office; 
some are results of independent studies car- 
ried out by participating schools and reported 
to the project office. As with any program 
which reaches into, many institutions and 
many different kinds of situations, the data 
show a rather wide range of results. In some 
schools, correlations between Orientation Test 
scores and accounting grades have been un- 
usually high, while with other groups in dif- 
ferent schools relationships shown have been 
on the disappointing side. The usual pro- 
cedure in dealing with such an accumulation 
of data is to generalize on the basis of cen- 
tral tendencies of results. This procedure 
has been followed in reporting on the validity 
and the reliability of the instruments used in 
the College Accounting Testing Program. The 
point is that it is usually an unsafe procedure 
to generalize from a single study based on a 
particular group of students. If the results 
obtained with one group are borne out with 
data from similar research based on different 
groups, it may be safe to generalize a finding 
or a trend. 

Most of the comparative validity data 
gathered at the project office has been con- 
cerned with a comparison of the Orientation 
Test and the ACE Psychological Examina- 
tion. This is true because the ACE test is 
the most widely used test of scholastic ability 
at the college level, and hence, most of the 
questions concerning superiority of the Ori- 
entation Test coming from institutions par- 
ticipating in the program related to the ACE 
test which, commonly, was part of the battery 
of tests already used in the college. The data 
from several of these studies are shown in 
Table 1, together with the results reported by 
Hendrix. 

The Table 1 correlations show varying re- 
sults, but only in the Hendrix study does the 
difference between the pair of correlations 
favor the ACE test. The data reported for 
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Table 1 


Correlations between Orientation Test Total Score and Grades in Accounting Courses Compared with 
Correlations between ACE Psychological Examination Total Score and Accounting Grades * 








Source of Data Institution 


r 
Orient. vs. 


r 
ACE vs. 
-N Grades N 





Project Office Study 
Project Office Study 
Project Office Study 
Project Office Study 
Roth Study 
Hendrix Study 


Drake University 


Wayne University 
CCNY 


University of Wyoming 


363 ‘ 294 


U. of Louisville Group A 166 ; 161 
U. of Louisville Group B 133 ‘ 134 


265 ‘ 99 
148 mw & 148 
95 R. 95 





*In most instances the ACE Psychological Examination is administered at the beginning of the freshman 
year and the Orientation Test at the beginning of the sophomore year, but before the students have completed 


a semester of accounting study. 


the project office studies: show differing N’s 
for the comparative correlations. This may 


raise some questions regarding the validity of - 


comparisons. The correlations between other 
test scores and accounting grades were ob- 
tained as supplementary studies following the 
checks on relationships between Orientation 
Test scores and grades. Scores on other tests 
were not available for all students taking the 
Orientation Test, with the exception of the 
University of Louisville Group B, but so far 
as is known, no bias occurred in the use of 
the smaller population. The difference in N 
is of most concern in the Drake University 
and Wayne University data, and it will be 
noted that the superiority of the Orientation 
Test is less noticeable in these two instances 
than in the case of the two University of 
Louisville groups where the N’s are in closer 
agreement. Furthermore, the study in which 
the N’s were the same, the one carried on 
by Roth at CCNY (unpublished), shows as 
much difference in favor of the Orientation 
Test as does Hendrix’s study in favor of the 
ACE. 

However, the point of this short note is not 
so much to argue the superiority of the Ori- 


entation Test as to suggest the danger in gen- 
eralizing the superiority of one testing in- 
strument over another on the basis of a study 
using the results from a single institution. 

The Hendrix study reports the only com- 
parison between the Orientation Test and the 
OSU Psychological Test which has come to 
the attention of the project office (OSU test 
vs. grades = .37; AIA test vs. grades = .32). 
As indicated with the ACE exam data, how- 
ever, it would be hazardous to generalize on 
the basis of this one bit of evidence. 

It is believed by this writer that a further 
note of caution could be added to Hendrix’s 
summary to the effect that “It does not nec- 
essarily follow that the same relationship 
would be obtained in a different institution 
and with a different group of students.” The 
data shown in Table 1 indicate that results 
do differ from one group to another, and they 
suggest further that the general trend in com- 
parative validity tends to favor the Orienta- 
tion Test rather than the ACE Psychological 
Examination. 


Received September 24, 1953. 
Published out-of-turn by the editor. 
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In the preceding article, Jacobs has sug- 
gested that a further note of caution could 
be added to the summary of the study re- 
ported by this writer in the April, 1953 issue 
of the Journal of Applied Psychology. The 
suggested note is: “It does not necessarily 
follow that the same relationship would be 
obtained in a different institution and with a 
different group of students.” 

One could certainly have no objection to 
such a statement. As Jacobs points out, nu- 
merous correlation studies have verified its 
accuracy. This writer most certainly did not 
intend to imply that the results of his limited 
study had general application. In fact, he 
tried to guard against such an assumption by 
stating in his opening paragraph that the 
study reported represented an investigation 
of the relative validity of the several tests 
, “for predicting success in elementary account- 
ing at the University of Wyoming” (italics 
added). 

While concurring with Jacobs’ desire to 
guard against generalization on the basis of 
a single study, one should be equally careful 
to keep a number of limitations of Jacobs’ 
own study in mind while considering his 
statement concerning “the relative superiority 
of the AIA Orientation Test when compared 
with tests of general scholastic ability in pre- 
dicting success in accounting study. . . .” 

First, there is the limitation growing out 
of the differences in the number of cases 
used in the computation of the coefficients of 
correlation between grades and Orientation 
Test scores and the number of cases used in 
computing the coefficients of correlation be- 
tween grades and the ACE. In the instance 
of the Wayne University data, 265 cases were 
used for one computation and only 99 for the 
second computation and in the Drake Uni- 
versity study the number of cases was 363 in 
one instance and 294 in another. While the 
author assumed that no bias occurred in re- 


ducing N, the possibility that bias did occur 
cannot be ruled out. 

A second limitation grows out of the fact 
that “in most instances” the Orientation Test 
was administered a year later than the ACE. 
One might assume that learning which took 
place during this year had no effect on the 
correlation between Orientation Test scores 
and accounting grades. One would also have 
to consider the possibility that the year of 
learning influenced either or both test scores 
and grades and consequently affected the cor- 
relation between the two. If the year of 
learning involved any accounting, one is con- 
fronted with an interesting effort to predict 
aptitude for learning that which has already 
been learned. 

A third limitation is that inherent in mak- 
ing generalizations about “the relative su- 
periority of the AIA Orientation Test when 
compared with tests of general scholastic 
ability . . .” on the basis of studies limited 
to comparison of the Orientation Test and a 
single scholastic ability test, namely the ACE. 

Possibly in an effort to keep his note brief, 
Jacobs has failed to mention the possibilities 
for more accurate prediction through the use 
of a number of predictors rather than a single 
predictor. Well-trained counselors seldom de- 
pend upon a single predictor. An increasing 
number of counseling agencies are construct- 
ing prediction equations based upon multiple 
variables. The question of whether the Ori- 
entation Test contributes significantly to such 
equations still has to be answered. 

It is entirely possible that studies in which 
the above listed limitations are not operative 
would provide proof that the AIA Orientation 
Test is superior to tests of general scholastic 
ability. Until such studies are cited, one 
would seem justified in retaining an open 
mind on the subject. 


Received November 4, 1953. 
Published out-of-turn by the editor. 





4 
x 

Tue Jounnat or -Apriiep PsycHoLocy 

Vol. 38, No. 1, 1954 


The Relation of Ninth Grade Test Scores to Twelfth Grade Test 
Scores and High School Rank 


Wilbur L. Layton 
Student Counseling Bureau, University of Minnesota 


The 9th grade is a crucial one for most 
students, for they, their school counselors, 
teachers and administrators must make deci- 
sions which are important for the students’ 
high school careers and in fact for their en- 
tire futures. High school guidance workers 
test many 9th grade students in order to as- 
sist them to select appropriate high school 
curricula. 

This study was an attempt to determine 
the meaning of 9th grade tests as predictors 
of over-all high school achievement and 12th 
grade test scores. 

In January and February of 1949, the 1947 
High School Edition of the ACE Psychologi- 
cal Examination was administered to approxi- 
mately 15,000 ninth grade students in Min- 
nesota through the state-wide high school 
testing program administered by the Student 
Counseling Bureau of the University of Min- 
nesota. The students tested in this program 
were from schools volunteering to participate 
in the program at their own expense. These 
schools consisted of approximately 50 per 
cent of non-metropolitan high schools in Min- 
nesota. Approximately 10,000 ninth graders 
were also given the Cooperative English Test, 
Form Y, Lower Level, Single Booklet Edition, 
Mechanics of Expression, Effectiveness of Ex- 
pression and Reading Coinprehension. Three 


Table 1 


N’s, Means and Standard Deviations for 9th Grade Test 
Scores and 12th Grade Test Scores and 
High School Percentile Rank 








Standard 


N Mean Deviation 





9th ACE 
12th ACE 
9th English 
12th English 
12th HSR 


2,173 
2,185 

690 
2,185 
2,185 


67.9 
94.7 
155.6 
172.7 
50.8 


18.4 
24.0 
64.4 
42.2 
28.7 





years later, in the winter of 1952, all the high 
school seniors in the state, including many of 
the 9th grade students tested in 1949, were 
tested on the 1947 College Edition: of the 
ACE Psychological Examination and Coop- 
erative English Test, Form S, Lower Level, 
Mechanics of Expression and Effectiveness 
of Expression. High school percentile ranks 
(HSR) were procured from the high schools 
for these seniors. The HSR was based on 
the senior’s scholastic rank in his class at the 
end of three and one-half years of work. 

A sample of 2,185 men and women who had 
been tested as freshmen was pulled from the 
files. Correlations were computed between 
Oth grade total ACE raw score, 9th grade 


Table 2 


Coefficients of Correlation between 9th Grade Test Scores and 12th Grade Test Scores and 
High School Percentile Rank * 








Tests 


ACE 
(12th Grade) 


Coop. Eng. HSR 
(12th Grade) (12th Grade) 





ACE (9th Grade) 
English (9th Grade) 
ACE (12th Grade) 
English (12th Grade) 


.80(2169) 
-75¢. 681) 


-71(2171) 
.82( 683) 
-74(2185) 


63 (2173) 
.71( 690) 
.65(2185) 
.74(2185) 





* In parentheses following the coefficient is given the number of cases upon which each coefficient is based. 
10 
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total Cooperative English raw score and 12th 
grade total ACE raw score, 12th grade Co- 
operative Eng.ish total raw score and HSR. 
Table 1 presents the means and standard 
deviations for each of the variables. 

As Table 2 shows, there was a substantial 
relationship between the 9th grade tests and 
the corresponding tests given in the 12th 
grade and with HSR. High School ACE 


taken in the 9th grade correlated .80 with 
College ACE taken in the 12th grade and .63 
with HSR. These results indicate the extent 
to which the high school counselor can inter- 
pret 9th grade test scores as predicting high 
school achievement and 12th grade test scores 
and can use these predictions to counsel 9th 
grade students. 


Received February 27, 1953. 
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The use of vocational interest measures as- 
sumes that workers in a given occupation have 
in common certain likes and dislikes, and that 
these preferences are different irom those of 
workers in other occupations. The extent to 
which an individual’s interest patterns match 
those of a group is determined by use of a 
scoring key on an interest inventory. This 
key is developed by using those responses 
which are made more, frequently by the spe- 
cific occupational group than by men-in- 
general (scoring these responses “plus’”’) and 
those responses made less frequently by the 
specific occupational group (scoring these re- 
sponses “minus”). How great the difference 
in response must be in order for a response to 
be scored is a difficult question to answer. 
The difference must be large enough to reduce 
to a negligible amount the number of chance 
differences. Yet the number of responses 
scored must not be so small as to yield a key 
which is too unreliable for use with individ- 
uals. Between these two limits it is still possi- 
ble to develop many different keys possessing 
rather widely varying characteristics. 

This paper summarizes work which has 
been done in trying out various methods for 
the developnient of scoring keys for the U. S. 
Navy~Vocational Interest Inventory. The 
reader will note that this work is strictly em- 
pirical, although the ideas which are tried 
out arise from theoretical work. To the ex- 
tent that interest inventory responses are 
unique in their psychometric characteristics, 
the findings of this report are limited in ap- 
plication. It seems reasonable to assume, 
however, that similar methods of key de- 
velopment would produce similar results when 
applied to such related measures as per- 

1 The research reported herein was carried out un- 
der Contract N6ori-212, T.O. III, NR 151-248, be- 
tween the Office of Naval Research and the Uni- 
versity of Minnesota, and, in part, under a grant 
from the Graduate School of the University of Min- 
nesota. Able assistance in major parts of the work 


reported here was given bY Mrs. Carolyn C. White 
and Mr. Norris Ellertson of the project staff. 


sonality inventories, biographical records, and 
the like. 


Samples Used 


Two occupational groups have been used, 
one civilian and one military. The civilian 
group is composed of 189 electricians obtained 
through labor union sources in St. Paul, Min- 
nesota, and, for cross-validation use, 174 elec- 
tricians similarly obtained in Minneapolis. 
Keys were developed by comparing their re- 
sponses with those of members of other oc- 
cupational groups from St. Paul and Min- 
neapolis. These were: milk wagon drivers, 
painters, plasterers, bakers, sheet metal work- 
ers, printers, warehousemen, plumbers, ma- 
chinists, shipping clerks, pressmen. 

The Navy group is composed of a sample 
of 261 Aviation Machinist’s Mates (AD’s) 
obtained through Receiving Stations on the 
east and west coasts, and a sample of 292 
AD’s for cross-validation purposes obtained 
from the Naval Air Technical Training Com- 
mand at Memphis. The Navy men-in-gen- 
eral sample used to determine the amount of 
overlap obtained for various keys is a sample 
of 200 men drawn randomly from a sample 
of 1,000 Navy rated men who had been drawn 
from the total Receiving Station sample in 
such a way as to reflect the distribution of 
rates in the Navy as a whole. The entire 
sample of 1,000 was used to obtain the 
percentages of responses of men-in-general 
needed in the development of* keys. 


Criteria of a “Good” Key 


For purposes of this study,-a scoring key 
is considered good if it does a good job of 
separating workers in a given occupation from 
workers-in-feneral. Thus, a key for Gunner’s 
Mates would perform ‘its function well if the 
distribution of scores of GM’s was markedly 
different from a distribution of scores of men 
in another fate, or of men in a variety of dif- 
ferent rates, In the following pages, the 
index of separation of such distributions 
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which shall be used is “percentage overlap.” 
This index gives the number of persons per 
hundred in one distribution whose scores can 
be matched by scores in the other distribu- 
tion. Perfect separation occurs when the 
highest score in one distribution is lower than 
the lowest score in the other; in this instance 
the percentage overlap is zero. No separation 
at all can be made if the two distributions are 
identical. When this occurs the percentage 
of overlap is 100.? 

A second criterion used in the evaluation of 
a scoring key is its reliability. In this report 
reliability is reported as test-retest reliability, 
obtained by scoring the interest inventories of 
90 men students at Dunwoody Industrial In- 
stitute, Minneapolis, who took the inventory 
twice, with an interval of about one month be- 
tween administrations. 

A different sort of criterion which may be 
used to evaluate methods of scoring the in- 
terest inventory would be the relative success 
of various keys for the prediction of school 
success or of advancement in rate, or the pre- 
diction of re-enlistment, or the prediction of 
military failure as evidenced by records of 
disciplinary action or less than honorable dis- 
charge. These methods of evaluation are 
obviously more pertinent for the application 
of interest inventory scores, but require the 
passage of a considerable period of time after 
administration of the inventory, to permit the 
individual to have a chance to achieve or fail 
to achieve. Accordingly, these criteria are 
not used in this report. One might expect 
that keys which do a good job of separating 
groups would prove tc be the same sorts of 
keys that would prove useful for these other 
purposes, but there is, as yet, insufficient evi- 
dence to warrant this expectation in the mili- 
tary service. Data have been collected, how- 
ever, which will give evidence on this. point 
after a sufficient interval of time has elapsed. 

In any development of scoring keys based 
upon empirical methods, there is always the 
possibility that differences between groups 


2 This is the index of overlap suggested by Tilton 
(Tilton, J. W. The measurement of overlapping. J. 
educ. Psychology, 1937, 28, 656-662). Tilton’s ar- 
ticle provides tables which may be entered using the 
difference in means for the two distributions divided 
by the average of their standard errors. For other 
characteristics of this index, see Tilton’s article. 


used to select responses for scoring are chance 
differences which, upon cross-validation, will 
tend to disappear. Accordingly, for each of 
the keys developed and reported upon in this 
report, a cross-validation sample has been 
used to determine the amount of regression 
to be expected. In addition, differences gen- 
erally have been required for scoring which 
are large enough to be well beyond the limits 
within which chance factors would be ex- 
pected to operate; this method of operating 
seemed desirable since each key is made up 
of only a small number of items selected from 
a total pool of 1140 item responses. 


Optimal Number of Items in a Scoring Key 


Finding no adequate rationale for deter- 
mining @ priori the number of item responses 
to score in developing an occupational key for 
the vocational interest inventory, attempts 
were made to make this determination em- 
pirically. This work was started with the 
hope that scoring could be done with less 
effort than is required with the Strong Voca- 
tional Interest Blank, which does a good job 
of separating out occupational groups, but at 
the expense of a weighting of many item 
responses to get a score. (Strong assigns 
weights varying from plus four to minus four 
to as many as five or six hundred of the twelve 
hundred possible responses to his blank.) 

The first work done to determine how best 
to develop a scoring key was done with the 
civilian electrician sample. A series of scoring 
keys was developed on the bdsis of the differ- 
ence in responses of the electrician and other 
skilled trades groups, as follows: a 6% key 
was developed by using ail item responses 
with differences in percentage responses of 
electricians and tradesmen-in-general of six 
per cent or more. In like manner, a 7% key, 
an 8‘~ key, a 9% key, and so on, were de- 
veloped. The series was stopped at a 26% 
key, when only 21 items remained for scoring. 

The comparative merits of each one of 
these keys may be inferred from the data 
presented in Table 1. These data indicate 
the existence of an optimal point in key de- 
velopment, since greatest separation occurs 
neither at the end of the scale with the small- 
est number of items, nor at the end of the 
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Table 1 


Comparison of Various Electrician Scoring Keys in 
Terms of Overlap and Several Esti- 
mates of Reliability 








Per Cent Overlap 

Test- 
Retest 
Relia- 





No. of 
Items 
in Key 


Cross- 

Original Validation 
Key N=189 N=174 bility 

6% 51% 50% BA 

7% 49 52 

8% 49 

9% 
10% 
11% 
12% 
13% 
14% 
15% 
16% 
17% 
18% 
19%, 
20% 
21% 
22% 
23% 
24% 
25% 
26% 





-_ 
~ 


47 


RSLS 


o> > 
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scale with the largest number of items. Keys 
with smaller numbers of items, in general, are 
to be preferred. It seems safe to conclude 
that, as one starts with a small number of 
items, the addition of more items increases 
the differentiating power of the key only so 
long as these items contribute more unique- 
ness than error; as error increases, the stand- 
ard deviations of both the criterion and men- 
in-genetal groups increase enough to offset the 
additional increase in mean difference con- 
tributed by these items. 

With a small number of items, however, 
some attention needs to be given to problems 
of reliability. When the only estimates of 
reliability that were available were made by 
other than test-retest means, this problem 
seemed serious enough to warrant the sacri- 
fice of considerable validity in order to 
achieve minimum reliability. As Table 1 in- 
dicates, however, very little is lost in the way 
of test-retest reliability, by a radical reduc- 
" tion in the number of items scored. 


A check on the degree to which: this gen- 
eralization about number of items in the key 
affecting the validity of the key has been 
made as part of another study using the 
Strong Vocational Interest Blank. In that 
study the best key was the one with the small- 
est number of items scored (24 items). How- 
ever, these were responses of psychologists, 
with no sample of answer sheets for men-in- 
general, so that a different measure of good- 
ness-of-key than percentage overlap was used. 
No evidence on test-retest reliability of this 
set of keys was obtained. Even so, it would 
seem that unit weighting of a fairly small 
number of items is warranted for scoring of 
vocational invenjory responses. 


Ds 
Effects of Weighting 
While the datégeon which the decision to 


use unit weights gas based are fragmentary, 
they indicate clearly that superior separation 


\ of groups can be attained by use of such unit 
, weights. Thus, per cent overlap between 


electricians and tradesmen-in-general was 
37% with the best unit-weights key, and was 
53% with a key weighted according to the 
formula used by Strong. The same figures 
for printers were 40% and 57%, respectively. 
Scoring the Strong blank, using best unit- 
weights key, placed men-in-general 3.71 stand- 
ard deviations below the mean for psycholo- 
gists in the original sample, and 4.03 standard 
deviations below the mean for psychologists in 
the cross-validation sample. Using Strong’s 
method of weighting, men-in-general fell 3.23 
standard deviations below the mean for psy- 
chologists.* 

These comparisons do not, of course, in- 
dicate that weighting would not improve 
separation of groups. In fact, the entire 
literature on multiple regression would sug- 
gest otherwise. What they do indicate is that 

8 These data were obtained from sub-samples of re- 
sponses of psychologists to the Strong Vocational In- 
terest Blank used by Kriedt in developing the 1948 
Psychologists key. See: P. H. Kriedt, Vocational 
interests of psychologists, J. appl. Psychol., 1949, 33, 
482-488. Kriedt reports that, for the total sample 
of 1048 psychologists, the means of professional men- 
in-general and of psychologists are 3.25 standard 
deviations apart, using the standard deviation of the 
psychologist group as the unit of measurement (p. 
484). Using identical computational methods, the 


sub-sample above gives a value of 3.23, giving good 
indication of its representativeness. 
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a simpler scoring system can separate groups 
as well as does the more involved method 
used with the Strong Vocational Interest 
Blank. In the interest of economy of scoring, 
it thus seems profitable to use unit weights 
until such time as a real superiority of mul- 
tiple weights is demonstrated. 


Heterogeneity of Content of Keys 


The selection of item responses for scoring 
solely on the basis of the percentage differ- 
ence in response of a reference group and a 
criterion group will tend, presumably, to give 
an ©ver-representation of items reflecting cer- 
tai aspects of the interests of the criterion 
group, and under-representation of other 
aspects. 
tricians, it might well be that 30 responses 
indicating a man liked to splice. wires, repair 
circuits, and the like, might be scored, whereas 
only one response indicating that a man 
wanted to study in the area of mathematics, 
electrical engineering, and physics might be 
scored. Yet both of these kinds of responses 
are characteristic of the responses of elec- 
tricians. 

In a sense the use of weights might be con- 
sidered as an attack on this problem. Most 
weighting is done, however, on the basis of the 
magnitude of the difference between men-in- 
general and the specific group, rather than on 
the basis of the amount of the factor already 
measured by other item responses. To devise 
an economical procedure for computing such 
weights directly would be a genuine contribu- 
tion. This project has not done so. In the 
absence of such a procedure, approximation 
methods must be employed. 

The first method employed in this study to 
improve the composition of a scoring key was 
an attempt to avoid including in a key too 
large a number of items' reflecting the central 
core of interests of an occupational group. 
An iterative method of item selection was 
therefore employed. First, the best ten items 
were selected; these were the items on which 
the responses of the criterion group differed 
most from the responses of the reference 
group. All members of the criterion group 
were then scored for their responses to these 
items. Another ten items were then selected; 
for each of these the difference in responses 


Thus, in developing a key for elec- 


between reference and criterion groups was 
still large, and the correlation with the com- 
posite of the first ten items was negligible. 
Another set of ten valid items (i.e., differen- 
tiating between criterion and reference groups) 
which did not correlate with these first twenty 
was then selected. Finally, ten more valid 
items unrelated to the first thirty were se- 
lected. This key is therefore a fairly hetero- 
geneous key which omits a rather large num- 
ber of items even though they differentiate 
members of the occupational group from 
tradesmen-in-general. 

The first groups on which this type of key 
was tried were civilian electricians. The elec- 
trician key which had been developed by 
simpler means, taking all item responses with 
a given percentage difference for the criterion 
and reference group, was already a satisfac- 
tory key. The percentage overlap of distribu- 
tions of scores of electricians and tradesmen- 
in-general was only 35% in the original group, 
and 41% in the cross-validation group. 

Even so, the use of the iterative method for 
selecting items for scoring in a key reduced 
overlap to 30% in the original sample, and 
to 35% in the cross-validation sample. And 
this is done without any real drop in the re- 
liability of the key, even though only 40 item 
responses are scored. 

The same comparison of an original key 
(developed by using all items showing a given 
minimum difference between criterion and 
reference groups) and a key developed by 
iterative methods was made using samples 
of Aviation Machinist’s Mates .(AD’s) ob- 
tained from Navy sources. The AD key de- 
veloped by origing! methods is not a very 
good key in terms of its separation of AD’s 
from Navy men-in-general, since the overlap 
of these two groups is relatively high—65% 
for the original group, and 58% for the cross- 
validation group. Its reliability is, however, 
rather good. On the other hand, the key de- 
veloped by iterative methods is a distinctly 
better key than that developed by original 
methods when one looks at the overlap be- 
tween groups, but has a reliability of only .74. 
These findings are in accord with those ob- 
tained with civilian electricians, except that 


‘differences are greater between different keys. 


(The reader should not generalize from these 
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Table 2 
Summary of the Characteristics of Various Scoring Methods Applied to Three Criterion Samples 








Group Type of Key 


Per Cent Overlap 





Cross- Test-Retest 
Original Vaiidation Reliability 





“Original 
Iterative 
Gulliksen 


Electricians 


Rec. Sta. AD’s Original 
Iterative 
Gulliksen 





two groups and assume that Navy groups are 
consistently harder to separate—the AD 
group was selected because it is a group that 
gives relatively poor separation from other 
Navy groups, and hence provides a severe 
test of the value of the various methods tried 
with a better-separation group.) 

In hopes of developing still better keys at 
less cost for computation, another type of key 
was tried. The method of developing this 
key requires selection of a fairly sizable pool 
of items, perhaps 100 or more, by taking those 
items with high validities, and then eliminat- 
ing those with high indices of internal con- 
sistency and only moderate validity. This 
type of key has been labeled, for want of a 
better title, the “Gulliksen Key,” since the 
steps taken are similar to those proposed by 
Gulliksen.* Specifically, a key is developed 
by selecting all items for which the criterion 
group response differs from that of the refer- 
ence group by a given amount or more (gen- 
erally, 12 to 15 percentage points). A large 
(1,000 for Navy, 550 for civilian groups) 
men-in-general sample is then scored using 
this key. The top and bottom 27% of this 
distribution is used to obtain an estimate of 
the reliability of each item; the difference in 
responses of the criterion and reference groups 
is used as an estimate of the validity of the 
item. These two values are .then plotted 
against each other much in the manner de- 

4Gulliksen, H. Theory of mental tests. New 

ork: John Wiley & Sons, Inc., 1950. See especially 
pages 382-385. 

SItem reliability and validity indices when ex- 

pressed in correlation terms have yielded, in this 


work, keys with almost identical characteristics as 
those obtained using percentage differences. 


35% 41% 88 
30 86 
28 86 


65% 8S 
51 74 
56 75 


scribed by Gulliksen (op. cit., p. 384) and 
items selected much as he recommends. The 
general effect of the method is to give prefer- 
ence to items which have good validity and 
which do not correlate highly with other items 
in the pool. 

It should be noted that this method is an- 
other approximation method, and is designed 
te select items having somewhat the same 
characteristics as the items selectéd by the 
iterative method. The Gulliksen method as 
here used is somewhat easier to employ, is 
more readily adapted to I.B.M. methods, and 
hence is more practical than the iterative 
method. It should also be noted that the 
values used as estimates of reliability and 
validity of items differ from those outlined 
in Gulliksen, since in this analysis gross per- 
centage differences are used in estimating 
these item characteristics. 

The comparison of overlaps and reliabili- 
ties of all of these new keys with the original 
keys developed for electricians and the Navy 
AD group is summarized in Table 2. In both 
instances, the Gulliksen key is distinctly su- 
perior to the original key in terms of overlap 
and is perhaps better than the iterative key. 
The superiority of both methods over the 
original key is retained in the cross-validation 
samples as well. In both the electrician and 
the Navy AD samples this gain seems large 
enough to warrant the use of the new key in 
spite of the fact that this key has a lower 
reliability than the original key. 

As noted above, a best unit-weights key 
for psychologists using the Strong blank re- 
sulted in superior separation of psychologists 
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from men-in-general as compared with a key 
weightedy according to the formula used by 
Strong. “The Gulliksen method described 
above was also applied to the Strong data 
but with a slight modification. Since no 
sample of answer sheets for men-in-general 
was available it was necessary to: base esti- 
mates of item reliability on the criterion 
group. The top and bottom 27% of a sub- 
sample of 604 psychologists were accordingly 
used. A 95-item key resulted from applica- 
tion of the Gulliksen method which differed 
very little in its effectiveness from the best 
unit-weights key previously mentioned. Using 
the Gulliksen key placed men-in-general 3.78 
standard deviations below the mean for psy- 
chologists as compared with 3.71 for the best 
unit-weights key. On cross-validation the 
comparable figures were 3.79 for the Gullik- 
sen and 4.03 for the best unit-weights key. 
It is to be noted, however, that this best unit- 
weights key contained only 24 items, and 
while test-retest reliability is not available, 
it is doubtful if it would be found to be ade- 
quate. This key included all items on which 
psychologists and men-in-general differed by 
33% or more in their responses. The Gullik- 
sen key used items with as low as 18% differ- 
ence. Ona best unit-weights key of 91 items 
(including all items with 24% or greater dif- 
ference between psychologists and men-in- 
general), and in the sense of number of items 
more nearly comparable to the Gulliksen key, 
the men-in-general means were 3.26 and 3.42 
standard deviations below the means for psy- 
chologists on test and cross-validation groups. 
The implication is clear that item for item, 
the Gulliksen key results in superior separa- 
tion, but interpretation must be cautious since 
information on reliabilities of these keys is 
not available. 


Summary 


The development of a method of scoring 
responses to an interest inventory so as to 
maximize the separation of workers in an 
occupation from workers in general involves 
consideration of many factors. Taking a cue 
from applications of multiple regression tech- 
niqués, we would expect that a point would 


be reached when the addition of more items 
in a scoring key would not be profitable; that, 


in general, the greater the heterogeneity of 


item content, the more effective would be the 
key; and that the use of weighting methods 
properly applied would increase the degree of 
separation of groups. Using as criteria of a 
good key its ability to separate groups (as 
measured by per cent of overlap of distribu- 
tions) and its test-retest reliability, it is 
theoretically possible to demonstrate the im- 
portance of each of these points. From a 
practical standpoint, however, one must de- 
termine whether or not approximation meth- 
ods are usable, and, if so, to what extent these 
various factors need to be considered when 
employing these approximation methods. 

This report summarizes various methods 
of developing keys, and provides support for 
the following statements: 


f When items are scored using unit 
weights, an optimum number of items can be 
found for scoring. For the samples used 
herein, this number seems to be between 40 
and 60; when either more or fewer items are 
scored, the discriminating power of the key 
is reduced. 

2. When item responses are weighted in the 
manner used by Strong in his Vocational In- 
terest Blank, the criterion group is not sepa- 
rated from the reference group as well as 
when unit scores using the optimum number 
of items are used. (This is not to say that 
some weighting system could not be devised 
which would be superior to unit scoring— 
obviously such a set of weights could be as- 
signed as to yield a score superior to any score 
by using multiple regression techniques. 
What this does say is that the method of 
weighting used by Strong is not superior to 
the method of unit weights.) 

3. When items are selected so as to in- 
crease the heterogeneity of content of a 
scoring key, the validity of that key is in- 
creased, and the test-retest reliability is some- 
what decreased. This is true whether items 
for such a key are selected by an iterative 
method as described in this report, or by an 
internal item analysis method. 


Received March 23, 1953. 
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Clerical Workers * 
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It is possible that the popularity of a re- 
liable and valid psychological test may be- 
come a weakness of that test at least in a 
given locality. There is some indication that 
such is the case with the Minnesota Voca- 
tional Test for Clerical Workers. In the se- 
lection of clerical workers this test has been 
a valuable aid to many business firms in Min- 
neapolis-St. Paul, Minnesota and elsewhere. 
Its extensive use in the above mentioned Min- 
nesota cities has resulted in many job appli- 
cants having taken the test several times. 
If the test is subject to “practice effects,” 


then the scores made by applicants who have . 


taken it more than once become of question- 
able value. 

Since the Minnesota Vocational Test for 
Clerical Workers has been so widely used, 
the question has been raised as to what effect 
practice has upon the scores. Earlier studies 
indicated a normal practice effect of from 7 
to 12 per cent after time intervals of from 
three to six months (1). This may not seem 
a prohibitive effect for the time intervals in- 
volved, but in actual employment practice 
much shorter intervals of time are probably 
the rule. An applicant may:apply for a job 
with several different companies within a mat- 
ter of hours or days. 

The purpose of this study was to measure 
practice effect on this test over relatively 
short intervals of time. Two groups of Uni- 
versity of Minnesota students in personnel 
psychology courses served as subjects. Group 
A was made up of 61 juniors, seniors and 
graduate students (41 men and 20 women). 
Group B was comprised of 36 Extension Di- 
vision students (24 men and 12 women) in 
an evening class. Group A was given the 
test successively on a Wednesday, Friday, 
and Monday, October 1, 3, and 6, 1952. 

* This research was made possible by a grant-in- 


aid from the Graduate School of the University of 
Minnesota. 


Group B, which met only once a week, was 
given the test on three successive Monday 
evenings September 29, October 6 and 13, 
1952. The purpose of the study was ex- 
plained to both groups and they were en- 
couraged to make as much improvement as 
possible. Since the number of subjects in 
each group was small and the time intervals 
between testing were not great, results for the 
day and night groups are combined. 

Table 1 presents the results of the com- 
bined groups. It is apparent that consider- 
able practice effects occur. All of the differ- 
ences between the means are significant at 
the .1 per cent level. When considered from 
the standpoint of what these differences mean 
in terms of centile ranks we observe that the 
mean scores on the original testing would 
have had centile ranks on norms for em- 
ployed clerical workers below 50 while the 
centile ranks of the mean performance when 
the test was taken the third time, range from 
72 to 91 on the same norms. 

A different type of analysis, presented in 
Table 2, shows much’ the same thing as the 
data in Table 1. On trial 3 from 91 to 97 
per cent of the subjects reach or exceed the 
mean score made on trial 1, indicating marked 
improvement. 

As has been shown elsewhere (1, 2, 3, 4, 
5) there is a decided sex difference in per- 
formance on this test. The subjects in this 
study behave similarly, as shown in Table 3. 
Comparing the results of men and women on 
trial 1 it is apparent that the women are 
superior to men. It is also obvious that this 
difference is consistent on successive trials on 
the test, i.e., comparing trial 1 for men with 
trial 1 for women, trial 2 for men with trial 2 
for women, and trial 3 for men with trial 3 
for women. When successive trials for men 
are compared, with the original trial for 
women the practice effect rather rapidly over- 
comes the original differences, and by trial 3, 





Practice Effects on Test for Clerical Workers 


Table 1 


Means, Standard Deviations, Differences between Means, t’s, P’s, r’s and Centile Rank the Means 
Would Have on Norms for Employed Clerical Workers 








Part A. Combined Male Groups, N = 65 





Numbers 
Trials 


1 2 3 1 2 











154.7 
24.4 
Dis = 43.3 
15.5 
001 
fii = .67 


142.5 152.5 


25.9 25.4 
Di: = 23.9 Dia = 33.9 D,, = 10.0 Diz = 30.7 
t 12.0 14.7 7.7 18.1 
P 001 001 001 001 
fir = 81 74 '2n = 91 fu = .89 
Centile 


rank of 
mean scores 27 62 72 47 81 


124.0 


29.3 


M 118.6 
Ss 26.7 





Part B. Combined Female Groups, N = 32 





Numbers 
Trials 


2 3 1 2 











157.9 170.8 


25.7 
Da = 36.1 
10.3 

001 

fis; = 77 


170.2 
26.3 22.7 
Diy = 33.2 Da = 12.3 
11.5 4.9 

001 001 

fig = 82 'n = 84 


142.7 
30.1 
Di = 28.1 
14.1 
001 
fi = .93 


Centile 
rank of 


mean scores 40 70 83 


36 68 





77 per cent (Numbers) and 80 per cent 
(Names) of men scored as high as did the 
women on trial 1. This is additional evi- 
dence of the seriousness of the practice effect 
on this test. 


Discussion 
The practice effect found on the Minne- 
sota Vocational Test for Clerical Workers 
can be explained in part by the nature of the 
test itself. First, the changed digits in the 


Table 2 


Percentage of Women and Men (Combined Day and Extension Groups) Who Reach or Exceed the Mean 
on the First Trial, or the Second Trial or Subsequent Trials * 





— - ——— 





Numbers 





Per Cent 
of Women 


Per Cent 
of Men 


Per Cent 
of Women 


Per Cent 
of Men 





80 85 87 
94 91 97 
80 66 69 


89 Trial 2 vs. trial 1. 
97 Trial 3 vs. trial 1. 
69 Trial 3 vs. trial 2. 





* Line one of Table 2 shows percentage reaching or exceeding on trial 2 their own mean on trial 1. 


shows same data for trial 3 compared to trial 1. 


Line two 


Line three shows same data for trial 3 compared with trial 2. 
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Table 3 


Percentage. of Men (Combined Day and Extension 
Groups) Who Reach or Exceed the Mean of the 
Women (Combined Day and Extension Groups) 
on the Various Trials of Taking the Test * 








Names 
Per Cent 
of Men 


Numbers 
Per Ce: ' 
of Men 


32 Trial 1 vs. trial 1. 
55 Trial 2 vs. trial 1. 
77 Trial 3 vs. trial 1. 
H Trial 2 vs. trial 2. 
40 Trial 3 vs. trial 2. 
30 Trial 3 vs. trial 3. 








* Line one of the table shows the percentage of men 
who on trial one reach or exceed the mean of women on 
trial one. Line two indicates the percentage of men 
who on trial two reach or exceed the mean of women on 
trial one. And line three gives the percentage of men 
who on trial three reach or exceed the mean of women 
on trial one. The remaining lines of the table present 


similar comparisons for trials two with two, trial three 
with trial two, and trial three with trial three. 
> 


number test all occur near the end of the 
second series of digits. If one catches on to 
this one can materially improve one’s score. 
Secondly, the items that are changed or are 
not changed tend to fall into patterns which 
may help a subject with good visual imagery 
on repeated trials on the test. Thirdly, on 
the names part of the test memory may be 
an important factor in bringing about im- 
provement on successive trials, because the 
subjects may be able to remember a fairly 
large number of the name pairs that are 
changed. 

Practice effect is not a new phenomenon; it 
has been found on other psychological tests 
besides the Minnesota Vocational Test for 
Clerical Workers. Wherever found, it is 
likely to be a weakness in any test that is to 
be used in selecting employees. Especially is 
this true when it is difficult or nearly impos- 
sible to determine how many times a job ap- 
plicant has taken the test previously. If one 


Longstaff 


could determine accurately how many times 
a subject had taken the test and how long a 
time interval had transpired between test- 
ings, correction factors could be worked out. 
But in the everyday world of employee selec- 
tion and placement, no reliable method of se- 
curing such information exists. Therefore, 
other ways of overcoming practice effects 
must be provided if a test subject to practice 


_ effect is to have maximum value. The use of 


alternate forms is one way to reduce this 
weakness in a test. 


Summary 


1. When the Minnesota Vocational Test 
for Clerical Workers is taken successively 
with short time interyals intervening, marked 
practice effects occur. 

2. With equal amounts of practice the sex 
difference on test performance remains about 
constant but with three practice trials men 
can practically equal the original perform- 
ance by women. 

3. Alternate forms of the test may over- 
come these weaknesses in the test. 


Received March 23, 1953. 
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The interrelationship between attitudes and 
motivation has already been noted by a num- 
ber of observers (2; 6, 8, 10). ‘In recent 
years, evidence of the pertinence of the atti- 
tude construct to behavioral criteria has been 
demonstrated best perhaps in the two-volume 
series entitled The American Soldier (9). 
Although much of the research reported in 
these volumes was concerned with service- 
induced attitudes, large segments of the work 
dealt with persisting attitudes derived from 
the serviceman’s reference groups external to 
the service. As a generalization, it might be 
said that in these studies attitudes were found 
to be functionally significant in determining 
the individual soldier’s orientation to military 
life and, accordingly, to his motivation (9, 
pp. 122-130). 


Problem 


This study set forth to determine whether 
certain attitudes which a Naval Aviation 
Cadet brings with him to the training pro- 
gram bear a relationship to his level of moti- 
vation in training. It is apparent, of course, 
that attitudes may be ordered in a hierarchy 
relative to their significance to this particular 
training situation. That is to say, one would 
hardly consider that just any attitudes would 
have significant relevance to motivation in 
this setting; on the other hand, it is apparent 
that attitudes toward study or discipline or 
flying may be of the utmost relevance. In 
evaluative fashion, then, one might arrive at 
a grouping of attitudes which are presumed 
to be of significance in relationship to the 
motivation of cadets in training. With this 


' Opinions or conclusions contained in this report 
are those of the authors. They are not to be con- 
strued as necessarily reflecting the view or the en- 
dorsement of the Navy Department. 

* The authors wish to acknowledge their indebted- 
ness to Dr. Brant Clark for his valuable assistance 
in the formulation of this report and to Dr. Richard 
Trumbull, Miss Marjorie Nicholson, and Mr. Calvin 
Nelson who acted as independent coders of the data. 
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in mind, it was considered that the area of 
interpersonal attitudes would provide a fruit- 
ful area for study. In particular, it was de- 
cided that attitudes toward authority-figures, 
in this case officer-instructors (flight and 
ground school), would be an appropriate be- 
ginning. The intent of the study was to de- 
rive implications for further investigations as 
well as to determine possible applications to 
selection. The basic hypothesis asserted for 
test was as follows: that attitudes toward 
authority-figures would significantly differen- 
tiate between cadets of “high” and “low” 
motivation. 


Procedure 


The measurement of attitudes, like the 
measurement of all psychological variables, 
offers challenging and oftentimes unique prob- 
lems. This is especially so where the attitude 
under scrutiny is both structurally complex 
and emotionally laden, in this case attitudes 
toward authority-figures. It soon became ap- 
parent that the traditional attitude scale was 
inadequate and inappropriate to the meas- 
urement of an attitude such as this. As a 
consequence, this technique was discarded in 
favor of the more flexible open-ended projec- 
tive questionnaire (7). 


The usefulness of this method of attitude- 
elicitation rests on the fact that it presents the 
individual with a relatively unstructured stimulus 
situation in which he may, with equanimity, and 
without being consciously aware of the process, 
bring forth feelings that might normally be re- 
pressed through social pressures and other forces. 
Thus, by the employment of this technique, the 
cadet who felt resentment toward an instructor 
might vent his feelings without fear of retribu- 
tion or guilt. The advantage of such a pro- 
cedure in a military setting is obvious. 

In its final form the questionnaire resembled 
superficially the form developed by Flanagan in 
his studies of “critical incidents” among Air 
Force personnel (3). That this was merely a 
resemblance should be re-emphasized, lest an 
erroneous impression be conveyed. The main 
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intent of the investigation was to procure infor- 
mation dbout instructors only insofar as this in- 
formation revealed the attitudes of the cadet 
group under study. The format of the question- 
naire was essentially simple. It was presented 
to the subjects under conditions of anonymity 
with the inference that only information was. be- 
7 solicited. In addition, subjects were spe- 
cifically asked not to divulge the names of the 
individuals about whom they were to write. The 
cover sheet of this questionnaire contained these 
instructions: “On each of the following pages 
you will be asked to write briefly about a person 
you have known while in the Naval Air Training 
Program. The instructions indicate that you are 
to relate just one incident which typified the atti- 
tudes and behavior which have led you to make 
a positive or negative judgment about this per- 
son. The incident, however, does not have to 
be the only one of its kind, nor must it have 
beer: the main basis for your evaluation of this 
person.” 

' On, the top of page one the following further 
instructions were given: “Think of the best in- 
structor you had during Pre-Flight or Flight 
Training. Give just one incident which typified 
the kind of attitudes and behavior which made 
you feel that he was the best. What were the 
specific details of his behavior in that particular 
situation?” : 

On the top of page two these instructions were 
given: “Now think of the worst instructor you 
had during Pre-Flight or Flight Training. Here 
again, give just one incident which typified the 
kind of attitudes and behavior which made you 
feel that he was the worst. What were the spe- 
cific details of his behavior in that particular 
situation?” 


Methodologically, two points deserve clari- 
fication: first, the “best”-“worst” dichotomy 
was utilized in an effort to secure a degree of 
polarization of response which would readily 
yield to differential analysis; second, the in- 
strument was administered under rigorous 
conditions of arfonymity so as to minimize 
any implied threat. 

For purposes of this investigation, motiva- 
tion was defined operationally. Cadets of 
“high” motivation were considered to be 
those who had successfully completed the 
basic flight stage of the Naval Air Training 
Program.? Cadets of “low” motivation were 


8 The program is divided into three major phases: 
Pre-Flight, Basic Flight, and Advanced Flight. In 
virtually all cases, cadets who have completed Basic 
have been in training for one year or more. By this 
time attrition is minimal and the likelihood of suc- 
cess is very high. 


, than that of the investigators. 


those who voluntarily withdrew from the pro- 
gram during this stage. 

During a three months period in the fall 
of 1951, the questionnaire was administered 
to a total sample of 137 cadets classified as 
follows: 72 cadets who were leaving training 
at their own request (the*“low” motivation 
group) and 65 cadets who had successfully 
completed basic flight training (the “high” 
motivation group). In both instances ad- 
ministration of the questionnaire was part of 
a routine check-out procedure and was usu- 
ally carried on with small groups numbering 
five or less. é 

A summary comparison of the two cri- 
terion groups will be found in Table 1. With 
respect to age and active duty time before 
entering training they were quite compar- 
able. On the whole, however, cadets drop- 
ping at their own request tended to have a 
significantly greater amount of formal educa- 
tion prior to training. This latter finding 
corroborates, in part, certain of the results 
growing out of a previous report from this 
command (1). 

Following the administration procedures, 
responses to the questionnaire form were ab- 
stracted so as to yield only core phraseology 
relevant to the instructor’s behavior and the 
cadet’s reaction to this behavior. These ab- 
stracts were thereupon transcribed on 3 x 5 
cards and assigned code numbers at random 
so as to eliminate insofar as was possible 
subjective bias in the content analysis pro- 
cedure which followed. Thus, at no time 
during the categorization of this data did the 
judges know the disposition of the cadet 
whose response was in hand. 

As a next step, all of the responses to the 
two instructional “sets,” that is, “best” and 
“worst” instructor, were sifted to secure de- 
scriptive elements of behavior. From these, 
a number of categories of behavior were de- 
veloped, which subsumed behavioral elements 
of similar quality and as much as possible 
used the language of the respondents rather 
In every in- 
stance, these categories were developed inde- 
pendently of one another in terms of an 
either-or criterion. That is, either the be- 
havior was described in the response or it 
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Table 1 


Summary Comparison of Motivation Criterion Groups with Regard to Age, Previous Education, 
and Previous Military Service 








Age 
(Years) 


Group Mean S.D. 


Previous Active 
Military Duty 
(Months) 


S.D. 


Education* 
(College Semesters) 


S.D. 





Mean Mean 





igh Motivation 
Successful) N = 65 
Low Motivation 
(Withdrawal) N = 72 


22.0 


22.2 1.4 


3.3 2.6 21.8 13.2 


5.5 2.5 20.6 14.0 





* The t test of significance for the difference between the means for the college education variable was found 
to be 4.89. This is significant at the 1% level of confidence. 


was not. Thus, overlap was possible and, 
indeed, very frequently took place—but only 
asa result of the respondent’s having men- 
tioned more than one major behavioral ele- 
ment. " 

As a check on reliability of judgment, 
three independent coders were asked to dis- 
criminate one category of behavior, for each 
of the two instructional sets, within the total 
population of responses to that set. Percent- 
ages of agreement with the principal investi- 
gators and the three independent coders were 
computed for the response categories selected 
for the reliability check. All were found to 
reach an acceptable level.‘ 


Results 


Frequency of response under each major 
category for the two cadet groups was sub- 
jected to a chi-square analysis. Table 2 pre- 
sents the findings of this procedure comparing 
the major categories of “best” and “worst” 
instructor behavior for the suqcessful and 
withdrawal cadet responses. In general, Ta- 
- ble 2 reveals that the cadets of high motiva- 
tion tend to manifest attitudes toward the 
interpersonal quality of instructor behavior 
while those of low motivation, on the other 
hand, tend to show attitudes directed at the’ 
instructor’s success or failure in his role as a 
teacher. Close scrutiny of this table indi- 
cates that under the “best” instructor set, 


4 Two response categories were checked. The per- 
centages of agreement for the response category 
patience were .96, 87, and .88 for each of the inde- 
pendent coders; for verbal assault the percentages 
were .95, .94, and .93, respectively. 


cadets of the “high” motivation group re- 
sponded with significantly greater frequency 
within the categories of personal interest and 
patience than did the cadets in the “low” 
motivation group. On the other hand, the 
“low” group, under this same set, responded 
with significantly greater frequency than did 
the “high” group within the categories good 
instructional techniques and extra help. Un- 
der the “worst” instructor set, the “high” 
motivation group reacted with significantly 
greater frequency than the “low” group 
within the category verbal assault and with 
significantly less frequency than the “low” 
group within the category poor instructional 
techniques. No significant difference between 
the groups was found within the indifference 
category, under this set. 


Discussion 


The results indicate that differences of atti- 
tude toward authority-figures do exist between 
cadeis o1 “high” and “low” motivation. The 
hypothesis, therefore, was substantiated. Spe- 
cifically, it would appear that there is a de- 
gree of variation in identification with in- 
structors between cadets of the two criterion 
groups. Indeed, it may be that this process 
of identification may account for the differ- 
ences obtained. 

While it was initially considered that the 
attitudes studied here were brought by the 
cadets to the training program, one might 
properly question the actual temporal rela- 
tionships involved. That is, were the atti- 
tudes toward these authority-figures brought 
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Table 2 


Chi-Square Analysis and Significance Levels between the Motivation Criterion Groups 
for the Major Response Categories 








“Rest” Instructor 





Per Cent* 


High Low 
Group Group 
(N=65) (N= site x? 





Response Category 








Showed Personal Interest 75 44 
Indicated Patience 43 20 
Used Good Instructional Techniques 37 63 
Gave Extra Help 9 26 


12.91 
8.08 
9.65 
6.49 








“Worst” Instructor 





Per Cent* 


High Low 
Group Group 
(N= 62) (N= 70) x? P 





Response C ategery 


Manifested Verbal Amault 61 21 








21.74 <.001 


Used Poor Instructional Techniques 18 40 


Indicated Indifference 


7.83 
37 1.22 


<.01 
> .30 





* The N’s given here for the groups represent the actual number of people in the criterion groups who re- 


sponded to the “set.” 


to the training situation, or were they condi- 
tioned mainly by experiences in training? A 
research project has recently been completed 
(5) which essentially duplicated the current 
investigation in order to provide an answer 
to this question. In this study, cadets just 
entering training were given a similar ques- 
tionnaire form in which they were asked to 
give parallel information on previously en- 
countered authority-figures, that is, high 
school or college instructors. The results of 
this study indicate quite conclusively that 
attitudes of the cadets who subsequently 
withdrew from training were similar to those 
of the “low” motivation group of the present 
study. This group tended to describe the 
skill or lack of skill of their high school or 
college instructor in his role as a teacher at 
a significantly higher level than did the ca- 
dets who remained in training. Thus, it ap- 
pears that attitudes toward authority-figures 
are among the attitudes persistently held by 
the cadets, and are related to their level of 
motivation in the Naval Air Training Pr>- 
gram. A number of related investigations 


designed to articulate this relationship still 
further are now being conducted. On the 
whole, it would seem that this attitude-elici- 
tation technique bears further scrutiny as a 
possible device for the assessment of motiva- 
tion in a number of settings. 


* 


Summary 


This paper reports on attitudes toward au- 
thority-figures which discriminated between 
Naval Aviation Cadets of “high” and “low” 


motivation. The “high” motivation group 
consisted of 65 cadets who had successfully 
completed Basic Flight Training, and the 
“low” group consisted of 72 cadets who were 
withdrawing from training voluntarily. Both 
groups were required to complete anony- 
mously an open-ended questionnaire form 
which required them to describe a sample 
of behavior characteristic of their “best” 
and “worst” officer-instructors (ground and 
flight). Content analyses were undertaken 
and frequencies for each content category 
were determined for both groups. The re- 
sults revealed that cadets of “high” motiva- 
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tion tended to manifest attitudes concerning 
interpersonal relationships with their officer- 
instructors while the “low” group stressed 
competence of the instructor in his role as a 
teacher. Interpretations were suggested with 
respect to cadet identification with authority- 
figures as a motivational factor in this setting. 


Received April 23, 1953. 
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Readability of Employee’s Letters in Relation to Occupational 
Level 
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In any form of written communication it is 
obviously of great importance to have infor- 
mation concerning the reading ability of the 
audience and. to use that information in the 
communication process. Since 1948 (10), a 
large number of articles have appeared in the 
psychological literature and elsewhere stress- 
ing the need to simplify and make more read- 
able the communications which managements 
direct to their employees. (See the bibliog- 
raphy by Hotchkiss and Paterson ([8].) 
These articles have suggested the use of read- 
ability formulas (most popularly those pre- 
sented by Flesch) as one means of control- 
ling the level of the communication and thus 
attaining the goal of better understanding. 
Many writers, following the lead of Flesch 
(5, 6) have used educational achievement as 
a base from which the comprehension ability 


of the audience may be estimated. Other 
writers have contributedthe results of read- 
ing comprehension tests given to selected sam- 
ples of special audiences. Here, for example, 
one finds the work by Bellows and Palmer 
(1) and Colby and Tiffin (2) on tie reading 


levels of foremen and supervisors. For the 
most part, however, some indirect estimation 
procedure must be used since the results of 
applying reading comprehension tests to rank- 
and-file employees in industry are not avail- 
able. 

This paper has a two-fold purpose, first, to 
advance tentatively another estimation pro- 
cedure and, second, to consider in its own 
right the data revealed by this technique. It 
was hypothesized that the readability level of 
employee-written communications should re- 
flect the effective literacy level of the em- 
ployees. It was further hypothesized that 
literacy level increases (as does education and 
intelligence) with higher occupational levels. 
This would mean, then, that the readability 
difficulty of employee-written letters as meas- 
ured by the Flesch formula should increase 


: 


as occupational level increases. Briefly, it 
was our belief that in general the complexity 
of one’s writing provides an indirect index to 
the complexity of material which one can 
readily comprehend and that, since it is gen- 
erally agreed that reading ability increases 
with occupational level, complexity of writ- 
ing will increase also. 


Method 


A total of 400 employee-written letters 
were made available from the General Mo- 
tors “My Job Contest” (Evans and Laseau 
[3]).1. These letters were randomly drawn 
from a 10 per cent sample of the 174,854 let- 
ters received in this contest.: While these let- 
ters are not “typical” writing samples from 
the employees, they are letters written under 
standard stimulus conditions and hence are 
uniquely comparable. 

Average sentence length, syllable counts, 
and Flesch Reading Ease scores were deter- 
mined for each of the letters on the basis of 
a 100-word sample from each letter. In 67 
instances the letters contained less than 100 
words, so the RE scores were determined by 
prorating, these on the basis of the total 
words available in that letter. The average 
length of these prorated letters was 71 
words. All counting was done independently 
of salary level information. 

It is to be noted im connection with this 
analysis that the determination of average 
sentence length for use in the Flesch RE 
formula is done on the basis of separate 
ideas, independent of punctuation (which 
was of dubious accuracy at best in these let- 
ters). This admittedly could introduce a 
source of error since a change of one sentence 


1 The writers wish to express their appreciation for 
the cooperation of the Employee Research Section, 
General Motors Corporation and especially to Dr. 
Chester E. Evans. The 400 employee letters and 
the occupational descriptions used in this article 
were furnished by that organization. 
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in the 100-word sample changes the average 
sentence length and the RE score markedly. 
However, the reliability of the Flesch RE 
measures has been shown to be quite satis- 
factory (7). 

Following the determination of the RE 
scores, letters were then classified by occu- 
pational level of their writers. There were 
two major groupings, the salaried and the 
hourly employees. 

The salaried group included the “skilled 
group with responsibilities added,” the 
“skilled group,” and the “partially skilled 
group.” Originally the salaried group in- 
cluded “learners” but this category was elimi- 
nated because of the small number of cases. 
The salary group was generally defined as 
follows: 


“Sub-managerial and clerical occupations 
involving supervising, coordinating, guiding, 
and performance of general clerical work. 
Primarily concerned with preparation, tran- 
scription, systematizing and filing of oral and 
written communications in offices, shops, and 
other places where such functions are per- 
formed.” 

The group of hourly employees was di- 
vided in accordance with the traditional clas- 
sification into “skilled,” ‘“semiskilled,” and 
“unskilled.” These were defined as follows: 

“Skilled: Includes craft and manual occu- 
pations that require predominantly a thor- 
ough and comprehensive knowledge of proc- 
esses involved in the work, exercise of con- 
siderable independent judgment, usually a 
high degree of manual dexterity, and, in some 
‘instances, extensive responsibility for prod- 
ucts and equipment. Employees in these oc- 
cupations often become qualified through ap- 
prenticeship or extensive training periods.” 

“Semiskilled: The exercise of manipula- 
tive ability of a high order within a fairly 
well-defined work sequence. The major re- 
liance, not so much upon the employee’s 
judgment or dexterity, but vigilance and 
alertness, in situations in which lapses in 
performance would damage equipment or 
product. These occupations may require the 
limited performance of part of a craft or 
skilled occupation.” 


“Unskilled: Manual occupation involving 
performance of simple duties which can be 
learned in a short period of time. Little or 
no independent judgment is required and such 
occupations require no similar job experi- 
ence.” 

Some letters were dropped from the sam- 
ple at each stage of the analysis. In all, 26 
cases were discarded leaving a total of 374 
letters for final analysis. As stated before, 
letters by “learners” were discarded. Occu- 
pational classifications were not available or 
were in doubt for several of the letters. 


Results 


Mean and standard deviation of RE scores 
were calculated for each of the occupational 
groups. These are presented in Table 1. 
Analysis of variance applied to the means of 
these groups yielded an F value of 10.61 
which is, of course, significant far beyond 
the .01 level. 

An inspection of Table 1 reveals that ‘a 
clear hierarchy of Mean RE scores is not 
only ga between the major groups but 
within them as well. The means for the 
“skilled” salaried people places them at 
Flesch’s *‘Fairly Difficult” level, typical of a 

\agazine, indicating reading achieve- 

Is from 10th to 12th grade and re- 

quiring @me high school for understanding. 
The “p§rtially skilled” salaried employees 
skilled” hourly employees write at a 

‘el equivalent to the digests, “Stand- 

icating reading achievement within 

d 9th grade levels which requires 

letion of 7th or 8th grade for un- 


Table 1 


Reading Ease Scores of Employee Letters 








Ovcupational 


Classification N Mean S..D. 





Salaried Employees: 

Skilled with Responsibilities 
Skilled 

Partially Skilled 


53.7 
53.6 
61.7 


Hourly Eynployees: 
Skilled 
Semiskifled 
Unskillyd 


64.0 
69.1 
72.9 
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i Table 2 
Percentage of Employees in Each Occupational Group Writing at Each Reading Ease Level 








Flesch Reading Ease Levels 


vD ' D FD S) FE E 
0-29 30-49 50-59 60-69 70-79 





Occupational VE 
Classification 


80-89 90-100 Total 





Salaried : 
Skilled with Responsibilities 
Skilled 
Partially Skilled 


44.4 
41.2 
19.0 


27.8 
17.6 
23.8 


22.2 
17.6 
23.8 


5.6 
23.5 
28.6 


100.0 
100.0 


All Salaried 
Hourly: 
Skilled 
Semiskilled 
Unskilled 
All Hourly 


All Employees 





derstanding according to Flesch. The “‘semi- 
skilled” and “unskilled” hourly workers write 
at a mean level which is like slick fiction, 


“Fairly Easy,” indicating a reading achieve- © 


ment of 7th grade and requiring completion 
of 6th grade for understanding. 

More revealing is the tabulation of the per- 
centages of persons of each occupational level 
who wrote at each of Flesch’s readability lev- 
els. These data are summarized in Table 2. 

Here again the progression of reading ease 
scores over the occupational level hierarchy 
is striking. For example, it may be seen that 
44 per cent of the “skilled with responsibili- 


100.0 


23.2 100.0 


21.4 19.6 
25.5 
13.3 
14.3 


15.4 


25.5 
18.3 
18.4 
19.5 


25.5 3.9 
5.0 
8.2 


53 


100.0 
100.9 
100.0 
100.0 


16.6 19.8 


4.5 


100.0 


ties” salaried group write at the “Difficult” 
level while only 2 per cent of the unskilled 
(hourly) group write at this same level. 

To further facilitate use of these data, the 
percentages of Table 2 were cumulated for 
each occupational level. The results are pre- 
sented in Table 3. 


Discussion 
The data, as presented, are of descriptive 
interest just as they stand. However, a cru- 
cial question remains. Since these letters are 


samples of employees’ writing, does this really 
indicate the reading comprehension level of 


Table 3 
Cumulative Percentage of Employees in Each Occupational Group Writing at Each Reading Ease Level 








Flesch Reading Ease Levels 





Occupational 
Classification 


VD 
0-29 


D FD S 
30-49 50-59 60-69, 


FE 
70-79 





Salaried : 
Skilled with Responsibilities 
Skilled 
Partially Skilled 
All Salaried 
Hourly : 
Skilled 
Semiskilled 
Unskilled 
All Hourly 


Total 


V 


44.4 ° 
41.2 
19.0 
33.9 


72.2 
58.8 
42.8 
57.1 


94.4 100 
100 
95.2 


98.1 


13.8 
12.4 


2.0 
11.0 


39.3 
28.7 
16.3 
26.4 


90.3 
74.7 


71.4 
76.7 
14.4 


31.0 79.9 








Readability of Employee’s 


these same employees There is no rigorous 
answer to this question at the present time. 
A consideration of the writing process as op- 
posed to-the reading process, production of a 
word as\opposed to its recognition, the spe- 
cial conditions of a contest with very sub- 
stantial prizes, the special pressures on the 
individual to make some kind of an entry so 
his group might receive a participation award, 
and the possibility that many of the letters 
were written with help from members of 
one’s family, neighbors, etc., preclude any re- 
alistic discussion of whether an individual 
writes at a level higher than, lower than, or 
similar to his reading comprehension level. 

Some supporting evidences, however, incline 
the writers to the view that this is representa- 
tive writing and that it is indicative of mini- 
mal reading skill. First, repeated analyses of 
house organs (presumably written by salaried 
employees who are “skilled with responsibili- 
ties”) show their mean level to be very close 
to that indicated in this study. In general, 
their writing averages RE scores of about 50 
(4, 11) as compared to the average of 54 ob- 
tained in this study. Second, the study by 
Bellows and Palmer (1) of reading compre- 
hension of foremen (who are presumably like 
the “skilled” hourly worker) seem to match 
very closely the data obtained in this study 
for this group. Their data are presented in 
modified form in Table 4 for comparison with 
the group from this study. Colby and Tiffin 
(2) find the median reading grade for factory 
supervisors to be the 10th grade level while 
this study shows a median in the 9th grade 
level. 

If one accepts the data from these letters 
as reflecting the minimal effective literacy 
level of the employees, then this industrial 
audience has been somewhat more clearly 
delineated. To reach 95 per cent of all em- 
ployees for example, ble 3 indicates it 
would be necessary to e at the “Easy” 
level of 80 to 90 (pulp fiction). This is the 
level which Flesch has predicted would reach 
91 per cent of the adult population. 

On the other hand, if one were concerned 
only with reaching the top salary-level group 
represented here (skilled with responsibilities 
added) 94 per cent of that group would find 


etters 


Table 4 


Reading C.,prehension Grade of Foremen as Measured 
by BeJows and Palmer in Comparison with 
E\jimated Comprehension Level of 
\ Skilled Sample in this Study 





Per Cent 
of Skilled 
(Estimated 
from RE 
Scores) 
N = 51 


‘ 
Reading ' Per Cent of 
Compre- Foremen 


hension (Bellows and Palmer) 
Grade Level} N = 100 


16+ 8 4 2.0 
13-16, 21 11.8 
10-12 * 26 25.5 
a 27 25.5 
7 6 25.5 

6 x 5.9 
4-5 8 3.9 





the “Standard” level within their reading 
comprehension. This would be a Reading 
Ease level of 60 to 70 and is typical of digest 
magazines. 

Writing at the “Fairly Easy” level (typical 
of slick fiction; Reading Ease 70 to 80) 
would be easily understood by 80 per cent of 
all employees. It would be well within the 
grasp of almost all salaried employees. How- 
ever, only 71 per cent of the unskilled em- 
ployees would readily comprehend tis “Fairly 
Easy” level of writing. 

It is interesting that this standard of RE 


‘of 70 or easier was recommended by Paterson 


and Walker (11), Farr, Paterson, and Stone 
(4), and by Lauer and Paterson (9) in their 
studies of industrial communications intended 
for “rank-and-file employees.” 


Summary 


A total of 400 employee letters were ran- 
domly drawn from the 10 per cent sample of 
letters received in the General Motors “My 
Job Contest.” One 100-word sample from 
each letter was analyzed by the Flesch Read- 
ing Ease formula. The letters were then 
sorted by occupational level of the writer. 
The mean RE score and the standard devia- 
tion were computed for each of six occupa- 
tional levels. Mean differences between the 
groups were highly significant. 

A hierarchy of mean RE scores was found 
to exist ranging from a mean of 54 (Fairly 





30 


Difficult) for the “skilled” salary groups to 
a mean of 73 (Fairly Easy) for the “un- 
skilled” hourly employees. A table showing 
the percentage of each group writing at each 
RE level was prepared to more fully describe 
the distributions. Some evidence suggesting 
that the writing was representative and in- 
dicative of comprehension level was f. 2sented. 
The results were interpreted as confirming 
previous readability studies of industrial com- 
munications and as providing a guide for the 
preparation of industrial communications. 


Received April 2, 1953. 
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Sam L. Witryol 
Department of Psychology, The University of Connecticut 


The primary purpose of this paper is to 
present an experimental comparison of thtee 
scaling approaches to the method of paired 
comparisons: Thirstone’s Case III and Case 
V, and Guilford’s Short Cut. Recent litera- 
ture pertaining to a variety of related prac- 
tical and theoretical developments will also 
be briefly reviewed. 

There appears to be a resurgence of inter- 
est on the part of investigators in many areas 
of psychology in the applicability of the 
method of paired comparisons to practical 
scaling problems. Fortunately some impor- 
tant original contributions, clarifying meas- 
urement problems and re-examining basic as- 
sumptions, have also recently been published. 
The work of Mosteller (21, 22, 23, 24) is 
exemplary and constitutes, in the opinion of 
the writer, the most brilliant rational discus- 
sion of paired-comparison scaling features 
since Thurstone’s early developments (26, 27, 
28). 

In a previous investigation (30), the writer 
employed Thurstone’s Case V for scaling 
paired-comparison data on teacher-generated 
motivational values in the classroom. The 
Case V scale values from four different class- 
room groups were compared with the values 
from the same samples scaled by Case III. 
The correlations between these scale values 
‘-obtained by these two methods from four 
sets of data were essentially unity, despite 
the fact that one of the assumptions for Case 
V (equal discriminal dispersions) appeared 
to have been violated. In the present in- 
vestigation the values obtained from these 
data by means of Thurstone’s Case III and 
Case V approaches were compared with those 
obtained by Guilford’s Short-Cut method 
(10). The Guilford method was rejected by 
the writer in the earlier study because it 
seemed to lack a defensible rationale for the 
discriminal unit. This decision will be re- 
examined here in the light of experimental 
findings from the present study and from re- 
lated researches. - 


Gulliksen (11) has discussed the broad 
scaling characteristics and power of the 
method of paired comparisons, and Burros 
(4) has pointed out that this “psychophysi- 
cal” procedure has special value for scaling 
stimuli when the “physical” correlates are 
not easily discernible. Coombs (5) has 
noted that the data in most scaling experi- 
ments are qualitative in nature. With these 
considerations in mind, it is worthwhile to 
evaluate the method of paired comparisons 
in terms of generality of measurement to 
various types of qualitative data and also in 
terms of economy, of application. 


Previous Research Findings 


In the present writer’s earlier investigation 
(30), the scaling rationale. developed by 
Thurstone for Case III and for Case V, and 
the rationale developed by Guilford for his 
Short-Cut method were described in some 
detail. The major relevant features will ‘be 
briefly reviewed here. Thurstone (26) ex- 
amined various assumptions under five spe- 
cial cases for the application of the method 
of paired comparisons to sdaling the psycho- 
physical law of comparativa judgment. The 
choice of these Thurstone, scaling methods 
generally hinges upon the yelection of either 
Case III or Case V, based fipon whether the 
stimulus dispersions are apfroximately equal 
or unequal. Thurstone adynitted that these 
dispersions could never be “directly observed 
(27), and later developed \ statistical pro- 
cedure for approximating yhe measurement 
of the “ambiguity of each! stimulus” (28). 
The important practical cpnsideration was 
the fact that Case V, a much simpler scaling 
device, |was indicated by Thurstone to be ap- 
plicable if the stimulus disyersions were ap- 
proximately equal. 

As a laborsaving alternatjve to Case V, a 
Short-Cut method was dev§sed by Guilford 
(8). The results with this (procedure yielded 
very high correlations with the results ob- 
tained from Thurstone’s Case V, but Guil- 
‘ 
| 


a 
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ford was unable immediately to present an 
adequate mathematical and _ psychological 
justification. Later, he attempted this latter 
task, but he could not develop a unit for the 
psychological scale (9). 

Recent empirical findings have been sug- 
gestive. Edwards (6) reported a very high 
correlation, approaching unity, between values 
obtained with Case III and Case V, but did 
not indicate the values for the stimulus dis- 
persions. Koch! (17) also réported very 
high correlations (about + .99) between 
scale values obtained with Case V and Guil- 
ford’s Short Cut. Satter (25) found Guil- 
ford’s approach to be a highly reliable method 
for job evaluation. These findings are in gen- 
eral consistent with the research experience 
(30, 31, 32) of the writer as well as with 
the experimental findings to be reported in 
the present investigation. 

However, an adequate rationale is not 
readily ascertainable from a review of the 
empirical findings, and one must turn to the 
recent brilliant efforts of Mosteller (21, 22, 
“ 23, 24) for the most productive and provoca- 
tive leads. Mosteller presents a careful 
mathematical rationale which makes it pos- 
sible to relax restrictions previously consid- 
ered basic assumptions for the application of 
Case V. Thus he has demonstrated that: 


1. An assumption of equal correlations, as 
well as one of zero. correlations, between the 
stimulus pairs is tenable (21). 

2. An aberrant stimulus standard devia- 
tion affects only the position of that stimulus 
involved (22). 

3. If the aberrant stimulus dispersion is 
near the center of the scale, scale positions 
of the other stimuli will not be seriously af- 
fected (22). 

4. The requirement of normality in the 
original distribution is not necessary (24). 

Furthermore, Mosteller (23) proposed a 
test of goodness of fit of observed to theo- 
retical proportions; this method is also de- 
signed to test unidimensionality. 

A recent rational development by Burros 
(4) is noteworthy. He worked out a method 

1 Obtained in part by personal communication 


from Dr. Helen L. Koch, University of Chicago, Oct. 
4, 1950. 


for estimating stimulus dispersions. His re- 
sults compare fayorably with Thurstone’s as 
valid estimates, and they have the advantage 
of requiring less arithmetical computation, 
although Burros’ formulae are more compli- 
cated. 

The problem of unidimensionality of the 
paired-comparison scale has received serious 
consideration. Most investigators have ap- 
plied Thurstone’s methods to data assumed 
to be ordered along a single dimension. How- 
ever, Gulliksen’s excellent analysis (11) dem- 
onstrated the feasibility of the application of 
the method of paired comparisons to multi- 
dimensional scales. In fact, he reasoned that 
this power of the paired-comparison method 
was a significant advantage over ordinary 
ordinal scales, and he reviewed researches 
which were exemplary of these possibilities. 
In any event, determination of unidimension- 
ality or multidimensionality is an important 
factor in a specific experimental situation. 

Mosteller (23), as noted above, has de- 
veloped a chi square test for unidimension- 
ality with the restrictive assumption relaxed 
to equal in addition to zero correlations be- 
tween the stimulus pairs. Kendall and Smith 
(14, 15) derived a “coefficient of agreement” 
(also “coefficient of consistence”) to test the 
assumption of linearity of the paired-com- 
parison variate under consideration. Johnson 
(13) describec this test, and, recently, Balin- 
sky, Blum, and Dutka (3) demonstrated its 
applicability in determining the consistency 
of product preferences. Finally, an experi- 
mentally provocative and potentially fruitful 
approach to multidimensional variates was 
suggested by Andrews (1). He performed a 
factor analysis of the multidimensional ele- 
ments in stimuli presented in paired-compari- 
son form; his analysis was derived from the 
table of proportions conventionally calcu- 
lated as part of the computational process. 


Experimental Procedure 


The paired-comparison data analyzed in 
this experiment were obtained in an earlier 
investigation (30) where the methodological 
details were fully described; the main fea- 
tures will be briefly reviewed. The stimuli 
consisted of a group of ten praiseworthy and 
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of another group of ten blameworthy cate- 
gories derived from teacher-generated moti- 
vational values as reported by school chil- 
dren. Each of these two groups of ten 
stimuli were presented in paireds¢omparison 
form to 1,120 school children in gfades 6-12. 
The subject’s task was to judge which of each 
pair of stimuli was more teacher-approved or 
disapproved. Case V scale values were calcu- 
lated from the responses to these stimuli for 
each sex by age-grade classification, so that 
each sample population included 80 subjects. 
Thus, a total of 28 sets of scale values, with 
ten stimuli in each, were computed. 

For purposes of the present experiment four 
sample sets from the above data were selected 
for comparative analyses by means of three 
different scaling procedures: Thurstone’s Case 
V and Case III, and Guilford’s Short Cut. 
The sets were selecied from the total popula- 
tion in such a manner as to represent both 
sexes, both experimental conditions (praise 
and blame) and, finally, different age-grade 
levels. Each sample represented a particular 
sex, experimental condition, and age-grade 
level. The specific nature of the sample sets 
can be readily observed from the captions of 
the tables and figures in the results, below. 


Results 


The scale values obtained by each of the 
three scaling approaches to paired compari- 
sons are presented in Tgbles 1, 2, 3, and 4. 
The discriminal dispersions of each of the 
stimuli, as estimated by Thurstone’s Case IIT, 
are shown in the last column of each table. 
Twelve product-moment correlations obtained 
by comparing the scaling results calculated 
by the three different approaches range from 
.987 to .999; these intercorrelations appear 
in the bottom three rows of the four tables. 
The averagés of the four intercorrelations ob- 
tained by comparing Case V with the Short 
Cut, Case III with Case V, and Case ITI with 
the Short Cut in all the samples are .998, 
.994, and .991, respectively. 

It should be noted from the tables that the 
discriminal dispersions in the last columns 
are not approximately equal. It can be seen 
by inspection that there is a considerable 
range in these estimated dispersions in each 


Table 1 


“Teacher Praise” Scale Values Computed by 
- Thurstone’s Case III and Case V and 
Guilford’s Short-Cut Methods 
(80 Sixth-Grade Boys) 
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of the four samples: Finally, the standard 
deviations of the scale values obtained by 
each of the three approaches are systemati- 
cally smaller in Case V and in the Short Cut, 
respectively, than in Case III. 


Table 2 


“Teacher Praise’ Scale Values Computed by 
Thurstone’s Case III and Case V and 
Guilford’s Short-Cut Methods 
_ (80 Twelfth-Grade Girls) 








Discriminal 
Dispersions 
(Estimated 
by Case ITT) 


1.318 
786 
1.127 
595 
1.019 
1.351 
.728 
823 
1.274 
980 


Behavior- 
Activities 
(Stimulus) 





Honest 
Polite 
Industrious 
Attention 
Independent 
Cooperative 
Obey 
Talking 
Clean 

Help 


o 

















Sam L. Witryol 


Table 3 


“Teacher Scold” Scale Values Computed by 
Thurstone’s Case IIT and Case V and 
Guilford’s Short-Cut Methods 
(80 Eighth-Grade Girls) 


Discriminal 
Dispersions 
(Estimated 
by Case ITI) 


Behavior- 
Activities Short 
(Stimulus) Cut 


Rude 1.06 685 
Dishonest 1.02 1.346 
Disobey my 93 925 
Disturb j 772 551 
Chew Gum é 770 1.281 
Fight 3: 69 1.184 
Poor Work 2 f 63 589 
Attention ; ; 54 910 
Talking j J 50 1.781 
Untidy 0 747 
a : ‘ 292 


Trit-v 
Till se 


\ 1V—se 


These quantitative results are graphically 
represented in Figures 1, 2, 3, and 4. 
Discussion 
The empirical comparisons in this experi- 
ment suggest the following conclusions: 
1. The Case V approach appears to yield 


essentially the same scale distribution as the 
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Fic. 1. “Teacher praise” scale values computed 
by Thurstone’s Case III and Case V and Guilford’s 
Short-Cut methods (80 sixth-grade boys). 


Table 4 


“Teacher Scold” Scale Values Computed by 
Thurstone’s Case ITT and Case V and 
Guilford’s Short-Cut Methods 
(80 Tenth-Grade Boys) 


Discriminal 
Dispersions 
(Estimated 
by Case ITT) 


Behavior- 
Activities Short 
(Stimulus) Cut 
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Case III method for the stimuli employed in 
this experiment. This is true despite the fact 
that one assumption for Case V—approximate 
equality of the estimated discriminal disper- 
sions—is grossly violated in each of the four 
samples. 

2. Guilford’s Short-Cut approach appears 
to yield essentially the same scale distribution 
as both the Case V and Case III methods for 
ordering the stimuli employed in this study. 
This is true despite the frequent observations 
in the literature that Guilford was unable to 
indicate a unit for his psychological scale. 

In the opinion of the writer, these conclu- 
sions, taken in conjunction with the empirical 
findings of other investigators, and considered 
from the standpoint of contemporary rational 
developments, suggest a number of practical 
and theoretical implications. One possibility 
regarding the violation of the assumption of 
equal discriminal dispersions for Case V is 
indicated from Mosteller’s work (22). He 
has reasoned that if an aberrant stimulus 
(i.e., dissimilar in discriminal dispersion) is 
near the center of the scale, there will not be 
much effect upon the ordering of the stimuli 
along the scale by means of Case V. This 
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Fic. 2. “Teacher praise” scale values computed 
by Thurstone’s Case III and Case V and Guilford’s 
Short-Cut methods (80 twelfth-grade girls) 


explanation provides a relaxation of the re- 
striction that the largest discriminal disper- 
sion should be no larger than twice the small- 
est dispersion for the employment of Case V 
(10). The most seriously aberrant stimuli 
in the data in the present investigation are for 
the most part the smaller ones, and they tend 
to fall near the middle of the scales in the 
four samples. It should also be kept in mind 
that Thurstone himself admitted that his cal- 
culated dispersions were estimates (28) and 
contained large probable errors. 

An important practical consideration 
emerges from these possibilities. If Mostel- 
ler’s rational efforts combined with the em- 
pirical findings in this study point toward an 
increasing generality of the applicability of 


Case V, then the labor of calculations will be 
greatly reduced, as compared fo Case III, and 
these conditions might then stimulate more 
widespread use of a very valuable, powerful, 
and somewhat neglected tool in psychological 
measurement, namely the method of paired 
comparisons. As a matter of fact, the classi 
cal reference to this tool as “psychophysical” 
is somewhat misleading since Thurstone has 
emphasized that (29, p. 142), “Although the 
law of comparative judgment is easily applied 
to the stimuli of classical psychophysics, the 
more generally interesting applications are 
those which involve social, moral, and esthetic 
values, opinion polls, and consumer prefer- 
ences.” More recently, the method of paired 
comparisons has been exploited in such di- 
verse areas as sociometry (17, 31, 32), in 
dustry (18, 19, 25), social motivation (30), 
and learning theory (12, 33). 

Guilford’s Short-Cut method provides an 
even more economical approach than Thur- 
stone’s Case V. The shortcoming of Guil- 
ford’s approach is the lack of an adequately 
defined psychological unit. Yet, it appears 
to “work,” as demonstrated in the empirical 
findings reported in the present study. Per- 
haps a possible rationale for this approach 
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Fic. 3. “Teacher scold” scale values computed by 
Thurstone’s Case III and Case V and Guilford’s 
Short-Cut methods (80 eighth-grade girls) 
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Fic. 4. “Teacher scold” scale values computed by 
Thurstone’s Case III and Case V and Guilford’s 
Short-Cut methods (80 tenth-grade boys). 


might be found in Coombs’ (5) “ordered 
metric” scaling concept, although he feels 
that the greater power of the method of 
paired comparisons is wasted, since the 
method of rank orders can be easily em- 
ployed for his “psychological scaling without 
a unit of measurement.” It is of interest to 
note here that Edwards (6, 7) has demon- 
strated the method of successive intervals to 
be an economical alternative to the method 
of paired comparisons. 

If statistical theorists can continue to re- 
solve some more of the rational problems of 
the method of paired comparisons, there is 
promise that this highly reliable technique 
will continue to be expanded to an increas- 
ingly larger number of qualitative problems 
in psychological measurement. This promise 
has been demonstrated by practical attempts 
to curtaii the amount of labor involved in the 
subject’s task, in the ordering of the pairs, 
and in scoring results. McCormick, Bachus, 
and Roberts (19, 20) studied the effects of 
decreasing the number of pairs upon the re- 
liability of the resulting scales. Angoff (2) 
has investigated the problem of removing 
obsolete items from an established paired- 


comparison scale and adding new items. 
Kephart and Oliver (16) introduced a 
punched card procedure as a_laborsaving 
device for ordering pairs and scoring results. 
This combination of empirical research and 
rational development has fortified the useful- 
ness of a powerful and extremely practical 
scaling technique, the method of paired com- 
parisons. Psychologists interested in research 
with qualitative data will find a valuable aid 
here. 


Summary 


The purpose of this study was to make an 
experimental comparison of Thurstone’s Case 
III and Case V, and Guilford’s Short-Cut ap- 
proaches to scaling paired-comparison data, 
and to review recent rational and empirical 
developments of theoretical and practical sig- 
nificance for the application of paired com- 
parisons to qualitative data. The stimuli 
were ten teacher-approved and ten teacher- 
disapproved behavior categories presented in 
paired-comparison form to four groups of 
school children. Each of the four groups 
contained a sample of 80 subjects and repre- 
sented a particular sex, experimental condi- 
tion (teacher-approved or disapproved behav- 
ior categories), and an age-grade level in the 
range from grades 6—12. 

The intercorrelations between the scale 
values obtained by the three methods in the 
four samples for both sexes under both ex- 
perimental conditions were approximately 
unity; twelve product-moment intercorrela- 
tions were .987 or higher. The results were 
interpreted as corroborative of recent rational 
and empirical investigations demonstrating 
the power of less complicated and economical 
approaches to scaling paired-comparison data 
than Thurstone’s Case III, with the relaxa- 
tion of certain restrictive assumptions. Pos- 
sibilities for broader application of the 
method of paired comparisons to qualitative 
psychological problems were reviewed. 
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Reliability and the Number of Rating Scale Categories 


A. W. Bendig * 


University of Pittsburgh 


A recent study (1) has presented evidence 
that the reliability of rating scales is inde- 
pendent of the number of categories on the 
scale. In this study Ss rated themselves on 
their comparative knowledge about 12 foreign 
countries. The scales used varied in the num- 
ber of verbal anchors used to define the scale 
categories (1, 2, or 3) and in the number of 
categories (3, 5, 7, 9, or 11). Both group 
and individual rater reliability was relatively 
invariant over the range from three to nine 
categories with a slight drop in reliability at 
eleven scale points. These results contradict 
the tieoretical analysis of Symonds (7) who 
concluded that scale reliability shoud in- 
crease with greater numbers of scale cate- 
gories, but that this increase becomes negli- 
gible above nine scale points. 

However, the empirical results reported 
were germane only to one type of reliability: 
the reliability with which Ss can distinguish 
between stimuli presented to them. This is 
the type of reliability analysis that is impor- 
tant when raters are given the task of rating 
stimuli on some criterion scale and the mean 
rating for each stimulus is to be used as the 
criterion measure. The assessment of the re- 
liability of pooled supervisor ratings of work- 
ers is a practical example of this type of prob- 
lem. A second question subject to reliability 
analysis is how well a series of self-ratings 
discriminates among the Ss. Many psycho- 
metric instruments can be regarded as a se- 
ries of stimuli presented to the Ss with the 
request that the S rate himself on a two, 
three, or R category scale. For example, the 
Strong Vocational Interest Blank commonly 
requires self-rating on a three-category scale 
while the revised Bogardus Scale of Social 
Distance uses a seven-point scale. Com- 
monly the total score of an S on these in- 
struments is the sum or mean of his ratings 
and the reliability question concerns the 


1Miss Janine Sprague assisted with some of the 
statistical computations. 


ability of the test’s total score to discriminate 
among the Ss. This type of reliability is the 
more usual “test reliability” compared to the 
first type which we might call “rater reli- 
ability.” Our first study (1) suggested that 
Symonds’ analysis does not hold for rater re- 
liability, but did not present evidence con- 
cerning test reliability. 

The present report concerns a study of the 
reliability of food preference ratings. Some 
years ago Wallen (9) found that responses to 
a check list of food aversions significantly 
discriminated between groups of normal and 
neurotic military personnel. Because of the 
restricted range of food aversions in his nor- 
mal groups Wallen reports no relixbilities for 
normal Ss. However, the data given (9, pp. 
79-80) permit the application of Kuder- 
Richardson formula 20 (8, p. 92). Using 
this formula the reliabilities for two normal 
groups are estimated to be .28 (N = 100) 
and 82 (N=114). The weighted mean 
(r-to-z transformation) of these two estimates 
is .64. The original Wallen check list used a 
rating scale with only two categories: strong 
dislike or acceptance of the 20 food stimuli 
presented to the S$. Symonds’ conclusions 
would suggest that the somewhat low reli- 
ability of this instrument should increase if 
the Ss were allowed to rate the foods on rat- 
ing scales containing more categories that 
would permit the Ss to make finer discrimina- 
tions among the foods. 


Procedure 


Scales. The stimuli to be rated by the Ss 
were the list of 20 foods used by Wallen (9). 
This list was given to each S with a rating 
scale having either 2, 3, 5, 7, or 9 categories. 
For the two-category scale the instructions 
used by Wallen (9, p. 78) were given. These 
instructions were modified for the other scales 
and three anchoring statements were used to 
describe the center and end categories on 
these scales. The statements used were: (a) 
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I like this food very much and eat it often; 
(b) I am somewhat neutral toward this food, 
neither liking nor disliking it much; and (c) 
I dislike this food so much that I refuse to 
eat it. 

The anchored scales with unit digits (1, 2, 
etc.) designating the categories were mimeo- 
graphed on single sheets with the list of 20 
foods and randomly distributed to the Ss. 

Subjects. The Ss were 249 students in in- 
troductory and social psychology classes. The 
ratings of Ss were excluded from the analysis 
whenever the S used less than one-half of the 
available categories in rating the foods. Thus 
an S, using the five-category scale, who used 
only ratings of one and five was not included. 
A total of 13 Ss was eliminated under this 
criterion, giving a study group of 236 Ss. 

Analysis. Test reliability was assessed us- 
ing the analysis of variance technique devised 
by Hoyt (4, 5) and recommended by Thorn- 
dike (8, pp. 93-96). Rater reliability was 
estimated by the similar procedures described 
by Ebel (2). Since the number of raters 
varied slightly from scale to scale, the re- 


liability of a single rater was computed to 


adjust for this varying N. Confidence limits 
(90 per cent) were computed following Ebel 
(2, pp. 413-414). Finally, the average rank- 
difference correlations between the rankings 
of the foods on the five rating scales were 
computed (6, pp. 80-84). 


Results 


The test reliability and rater reliability esti- 
mates for each rating scale can be found in 
Table 1 along with the 90 per cent confidence 
interval for each reliability. The five test re- 
liability estimates were tested for homogeneity 
using the chi-square method described by Ed- 
wards (3, p. 135). The resulting chi-square 
value was 1.12 which, with four degrees of 
freedom, is not significant at the .05 confi- 
dence point. The mean reliability was .625. 
A similar test of the homogeneity of the rater 
reliabilities gave a chi-square of 1.76 which 
again is not significant. The mean rater re- 
liability was 0.23. 

The average rank-difference correlation be- 
tween the rankings of the 20 foods on the five 
scales was .90 when corrected for ties. Since 





Table 1 
Reliability Estimates of Food Preference Rating Scales 
with Various Numbers of Scale Categories 


Number of Rating Scale 
Categories 


3 5 

Number of Subjects 41 
Test Reliability 
Confidence Limits (.90) 

Upper 

Lower 
Rater Reliability 
Confidence Limits (.90) 

Upper 3 34 34 


Lower 20 1 13 13 


there were a number of foods tied in rank on 
the scales with two and three categories a 
similar average rho was computed on the food 
rankings on scales with five, seven, and nine 
categories and was found to be .91. 


Discussion 


The results in terms of the test reliabilities 
is fairly unequivocal. No consistent trend 
was found in the relation of test reliability 
and number of scale categories. This sug- 
gests that Symonds’ (7) analysis does not 
hold for test reliability. It is interesting to 
note that the mean reliability found for the 
five scales, .625, is very similar to the esti- 
mate from Wallen’s data, .64. While the 
highest reliability, .70, was found with seven 
categories, the two lowest reliabilities, .58 and 
.60, were found with the immediately adjacent 
numbers of categories (five and nine). 

Rater reliability was not as regular as test 
reliability. The invariance of reliability over 
the range of five to nine categories that was 
found in a previous study (1) is here con- 
firmed. However, rater reliability rose at 
three categories and dropped for two cate- 
gories in this study. The drop at two may be 
attributable to the slightly different instruc- 
tions to the Ss with this scale. The slightly 
greater reliability with three categories can- 
not be explained by different instructions, al- 
though it must be pointed out that this re- 
liability is not much higher and, when tested 
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statistically, is not significant. Before we 
can extend the conclusion of invariant reli- 
ability below five scale categories further in- 
vestigation will be necessary. 

It is interesting to note that the rater re- 
liabilities found for our list of 20 foods is 
somewhat less than that found for ratings of 
foreign countries (1, p. 39). This lower rater 
reliability for foods may be a function of the 
type of judgment required of the Ss, of the 
greater number of judgments required of the 
Ss (20 instead of 12), or of a greater homo- 
geneity among the 20 foods than was present 
among the 12 countries. 


Summary 


Ss (N = 236) rated 20 foods as to prefer- 
ence using rating scales containing 2, 3, 5, 7, 
and 9 categories. Test reliability (summed 
ratings for each S$) and rater reliability 
(summed ratings for each food) were com- 
puted for each scale. Test reliability was 
constant over the entire range of categories 
and was very similar to reliabilities found in 
another study. Rater reliability was con- 
stant from five to nine categories, but was 


slightly lower at two and slightly higher at 


three categories. It was concluded that test 


reliability is independent of the number of. 


. Bendig 


scale categories, and that rater reliability is 
relatively constant, but that further research 
on rater reliability using short scales is 
needed before a similar generalization can be 
made regarding rater reliability. 
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The Inference of Accident Liability from the Accident Record 


Alexander Mintz 


City College of New York 


It has been known for a long time that acci- 
dent liability ' of people, that is their poten- 
tial long range accident rate, and the actual 
number of accidents occurring to them, i.e., 
their accident record, are imperfectly corre- 
lated. This was already clearly implied in 
the classical 1920 paper by Greenwood and 
Yule (3) on accidents. Newbold (9) pre- 
sented in 1927 a formula for estimating the 
correlation between accident records and acci- 
dent liability. Cobb (1) pointed out in 1940 
that this correlation need not be high. Mintz 
and Blum (8) examined a large number of 
published distributions and four! that the 
estimated variance of accident liability usu- 
ally accounts only for a relatively smail por- 
tion of the variance of accident records, thus 
confirming Cobb’s finding. Quite recently, 
Hughes (4) included in his summary of the 
mathematical research on accidents tables and 
graphs implying the imperfect correlation be- 
tween accident liability and records. These 
tables and graphs utilize Greenwood and 
Yule’s theoretical inference that for any one 
particular degree of liability in people there 
should be a Poissonian distribution of acci- 
dent records. His table presents the prob- 
ability, for different degrees of accident lia- 
bility and for different mean group liabilities, 
that a person should have twice as many acci- 
dents as the mean for the total group. 

In all papers mentioned the notion is 
lized that a given degree of accident liahil:. - 
tends to result in a Poisson distribution of 
accident records. This notion has many theo- 
retical uses, but its practical usefulness is lim- 
ited by the fact that in the case of particular 
individuals the degree of accident liability is 
generally unknown, so that the Poissonian 
probability distributions of accident records 


“ti. 


1“Accident liability” is a more general term than 
accident proneness because it includes both personal 
and environmental conditions predisposing people to 
accidents. Exact constancy of environmental haz 
ards is hard to prove, so that it is probably nor- 
mally more accurate to refer to accident liability 
rather than proneness 


41 


cannot be arrived at. The accident record of 
individuals is often available. What is often 
needed is a procedure for estimating the un- 
known accident liability in terms of the 
known accident record. The main problem 
of this paper ist given a known distribution 
of accident records, and a particular accident 
record belonging to this distribution, how 
probable are the different assumed degrees of 
accident liability which may correspond to 
this particular accident record? 

In this general form, the problem has no 
answer. It will be treated here in terms of 
certain assumptions first explored theoreti- 
cally by Greenwood and Yule (3) and em- 
pirically by Greenwood and Woods (2). 
These assumptions were: 

(1) accident liability of people is not 
changed by accidents in which they are in- 
volved and does not vary with time; * and 
(2) accident liability varies «mong people 
and is distributed in some known manner, 
e.g., in accordance with a Pearson Type III 
curve. 

These assumptions have not been definitely 
shown to be true, but they are fairly well sup- 
ported by available evidence, so that a fur- 
ther exploration of their implications is in 
order. 


The Solution 


The following considerations indicate the 
nature of the solution. Accident liability and 
accident record may be treated as two cor- 
related variables, the former as the independ- 
ent, the latter as the dependent variable. The 
distribution of accident liability is assumed 
to be known, or to be capable of being esti- 
mated from the data; so are the theoretically 
Poissonian distributions of accident records 
in the columns of the scatter diagram. To- 


2 Or approximately known. There are a number 
of pitfalls in the way of precise characterization of 
accident records which have been discussed in the 
literature. 

8 Actually, a somewhat weaker assumption is suffi- 
cient, as has been pointed out by Kerrich (6) 
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gether these two types of information define 
a complete correlation surface; this correla- 
tion surface describes the probability distribu- 
tion of various possible combinations of acci- 
dent liability and accident record. There 
should be no difficulty in determining the dis- 
tributions in the rows of such a correlation 
surface. Such a distribution would indicate 
how probable are various degrees of accident 
liability in the case of a particular accident 
record, presupposing the assumptions of ac- 
cident liability being unaffected by the oc- 
currence of accidents and having a known 
distribution. 


The mathematical derivation of such a dis- _ 
It assumes as — 
was suggested by Greenwood and Yule that © 


tribution is presented below. 


accident liability is distributed in a Pearson 
Type III curve. In this particular case the 
answer is a very simple one: If the distribu- 
tion of accident liability for the whole group 
is of the Pearson III type, then the probable 
distributions of accident liability are also of 
the Pearson III type, but with changed con- 
stants in the formula. The changing of the 
constants results in changed means and stand- 
ard deviations which vary from those of the 
whole group, and also vary according to the 
accident record of the specific subgroups. 


Mathematical Derivation 


Poisson distribution: Probability of 7 acci- 
dents for group with accident liability \: 
en? 


“1 , 


J: 


where ¢ = 2.718 --- 
rithms). 
Pearson IIT distribution of accident liability: 


(base of natural loga- 


——- @ “chy P 1 


r 0) 


where X is liability and ¢ and p are constants 
related as follows to the mean (m) and variance 


p 


m=-,y=-, 
eg" c? 


(v) of the distribution: 


Greenwood-Yule derivation of negative bino- 
mial distribution: Probability of \ liability and j 
accidents: product of formulae for Pearson II 


and Poisson distributions: 


P en? 
c Ke a rend (e+) AX pti—1, 


~ iP (p) 


ammenet e7>yP- 1 xX-— 
I'(p) 
To determine the Lege", of j accidents for 
all \-s, this expression has to be integrated over 
all values of A, so thatO< A < @- 


in ee (etl) AX Pt) ld = 
0 7'l (p) 


, a dx 
if (c+ 1)X = x, dA = ag 


cP ie e~*xPti-l dx 
~ jp) Jo (c+ 1c $1 


c y’ I, e- ae 
~ Ne+1 ji (pyle + 


(by definition of T caiies 
B's I'(p + )) ; 
~ Ne+1 jl (p)(c + 1) 


(general term of negative binomial distribu- 
tion). 

Derivation of probable distribution of accident 
liability for given accident record: Probability 
of accident liability \ and accident record j, in 
relation to all possible combinations of A — s 
and j — s: 

c” 
e ~(e+l) AX P+ 1 


jIr(p) 
Probability of accident liability \ and accident 
record j, in relation to the combined probability 
of combinations of this particular j with all 
A— 8: 
c” 


iir(p) ‘ 


(5) 


(e+ ANP tI 1 


rip+ i) 
iT (p)(c + 1)’ 


cP 
» ae — g~(etrl)ry po j—h 


r(p + j) 


(Estimated distribution of accident liability > 
corresponding to given accident record 7. The 
distribution is one of Pearson’s Type III. It 
has an equation of the same form as, that given 
above for the Pearson III distribution, but 
with changed constants; ¢+ 1 replaces c, 
p +7 replaces p.) 
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The estimated Pearson III curves of acci- 
dent liability for subgroups with given acci- 
dent records may be interpreted in two ways: 
first, as representing the probable numbers of 
people with various levels of accident liability 
in a subgroup with a given accident record; 
and, second, as representing the degrees of 
probability of these various levels of liability 
for people with given accident records. Only 
the second interpretation is appropriate in the 
case of small subgroups. 


Illustrative Results 


The accident distribution reported in Green- 
wood and Woods’ Table 8A was used for the 
computation of Pearson III curves as just ex- 
plained. This set of data was chosen for two 
reasons: (1) it can be closely approximated by 
the theoretical distribution derived from the 
Greenwood-Yule assumptions (the so-called 
negative binomial distribution), which sug- 
gests that these assumptions may hold true in 
this case; (2) this set of data was suggestive 
of a higher correlation between accident rec- 
ord and liability than the other Greenwood 
and Woods sets of data (2). It was thought 
therefore that the demonstration of a rela- 
tively wide spread of probable accident lia- 
bility corresponding to a particular accident 
record would be particularly convincing. Ta- 
ble 1 presents this set of data, together with 
the negative binomial distribution fitted by 
the method of moments. 

Figure 1 presents the three Pearson III 


Theoretical 
(Neg. Binomial) 


Accid. 


People 


8 8.7 
11 10.1 
8.9 
69 
5.0 
3.5 
2.4 
1.6 
1.0 
0.7 
0.4 
0.3 


49.5 





curves for the subgroups with zero, five, and 
eleven accidents (a subgroup of one). 

The curves show that a large group whose 
members have had five accidents apiece is 
likely to include some persons whose poten- 
tial accident records have a very wide range. 
There is actually some noticeable overlapping 
even between the probability curves of lia- 
bility of the two extreme subgroups with zero 
and eleven accidents. 

There is a very considerable amount of 
overlapping between the liability curve for 
the five-accident group and the other two. 

The estimated Pearson III distributions of 
accident liability for people with given acci- 
dent records enable one to estimate the com- 
bined probability of their accident liability 
falling within certain ranges, e.g., the range 
below the mean of the whole group or above 
twice the group mean, or from the first to the 
third quartile of the whole group. One can 
do this by integrating the expression for the 
Pearson III curve, or by using tables of the 
Pearson III integral (e.g. 10). 

Table 2 presents the results of such a pro- 
cedure for two published distributions of ac- 
cidents. These distributions are that in 
Greenwood and Woods’ Table 8A and that 
of 29,531 Connecticut car drivers discussed 
by Cobb (1). The figures represent the prob- 
ability that the true accident liability of a per- 
son with a given accident record is below the 
mean * of the whole group. The two means 
were 2.8 accidents per person and .24 acci- 
dents per person for the Greenwood-Woods 
set 8A and for the Connecticut drivers, re- 
spectively. 

In the Greenwood-Woods’ set of data, a 
person who had no accidents has 95.2 chances 
in a hundred of having accident liability be- 
low the mean of the whole group. For peo- 
ple with 1 accident, the probability of acci- 
dent liability below the group mean of 2.8 is 
85.7 per cent, and so on. Similar statements 
can be made about the Connecticut car 
drivers. It should be noted that 95 per cent 


4In the subsequent discussions, there are references 
to the probability of accident liability being either 
above or below the group mean. This is done for 
the sake of simplicity; the infinitesimal probability 
of accident liability being exactly at the group mean 
is disregarded. 
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butions of different degrees of accident liability for given accident records. 


and Woods, Table 8A. 


certainty about accident liability being greater 
than that of the group mean is not reached 
until the person has had 8 or more accidents 
in the Greenwood-Woods’ sample, 3 or more 
accidents in the sample of Connecticut driv- 
ers. These accident records, are reached by 
only a few persons—5 out of 50, or 10 per 
cent in the former.case, 51 out of 29,531 or 
.17 per cent in the latter case. 


Table 2 
Probability (in Per Cent) of Accident Liability 


Below the Mean of the Whole Group 


Connecticut 
Drivers 


Greenwood-W oods 
Table 8A 


Number of 
Accidents 
95.2 
85.7 40.9 
70.6 15.9 
52.6 4.9 
33.0 1.1 
20.9 0.3 
11.3 01 

5.6 

2.5 

23 

0.4 

0.1 


76.6 


10 ll 12 


(e+DyPt+i-!) representing the probability distri- 


Source of data: Greenwood 


Discussion 

The immediately preceding  statemeiits 
should not be confused with the customary 
statements of the level of statistical signifi- 
cance. If one states that a finding. is signifi- 
cant at the 5 per cent level, one means that, 
if the null hypothesis is assumed to be valid 
for the population, deviations as great or 
greater than the one found are expected to be 
found in only 5 per cent of the samples. The 
statement, “the probability of accident lia- 
bility below the group mean is 5 per cent in 
people having 3 accidents” does not presup- 
pose. an assumed null hypothesis and is not 
intended to be a test of a null hypothesis. 
On the contrary, it presupposes the existence 
of differences in accident liability and charac- 
terizes the probable percentage of below-av- 
erage liability among people who had 3 acci- 
dents each. 

Taken at their face value, the figures in 
Table 2 exhibit the way in which accident 
liability below the group mean becomes less 
probable and accident liability above the 
mean becomes more probable in the case of 
persons with the larger numbers of accidents. 
Clearly, the accident records have some va- 
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lidity as information about accident liability. 
In the Greenwood-Woods set of data, 5 or 
more accidents mean accident liability above 
the group mean in at least four cases out of 
five. In the Connecticut drivers’ sample, 
drivers who had 2 or more accidents can be 
expected to have accident liability above the 
group mean in more than 5 cases out of 6. 
On the other hand, relative certainty that a 
given individual has accident liability above 
the mean of the whole group can only be 
achieved in a very small number of people. 
Whether one chooses to emphasize the fact 
that accident records have some validity as 
long range predictors, or the limitations of 
their validity is presumably dependent on 
one’s scientific level of aspiration. 

Should figures be taken at their face value? 
The answer depends on whether the Green- 
wood-Yule unequal liability assumptions are 
to be accepted. The principal evidence on 
their validity seems to be as follows: 

1. The negative binomial distribution which, 
as theoretically derived by Greenwood and 
Yule, was based in part on these assumptions 
fits many obtained accident distributions very 
well. Newbold (9) showed, in effect, that it 
usually fits them better than any other theo- 
retical distribution embodying the ideas of 
unequal accident liability in people and un- 
changed accident liability after accidents. 

2. The negative binomial distribution can 
be derived by the use of different sets of as- 
sumptions and therefore does not differentiate 
between them and the Greenwood-Yule as- 
sumptions. Thus Irwin (5) showed that the 
negative binomial distribution is to be ex- 
pected if there are no initial differences in 
accident liability and that accident liability 
of people increases as a linear function of 
accidents. Lundberg (7) quotes rather simi- 
lar deductions by Polya and Eggenberger. 

3. There are a few available sets of acci- 
dent data for the same people during con- 
secutive periods. In terms of the Greenwood- 
Yule assumptions the accident rate of people 
should remain constant. In terms of the as- 
sumptions explored by Irwin they should in- 
crease. According to th> evidence presented 
by Irwin and by Kerrich (6), the accident 


rates vary only slightly with time, and tend 
to decrease rather than to increase. 

In terms of the evidence presented, the ma- 
jor inferences from the Greenwood-Yule as- 
sumptions appear to be in accord with avail- 
able data in many cases. However, more re- 
search is needed, particularly in view of the 
scanty available evidence on accidents in suc- 
cessive periods. There are theoretical con- 
siderations making the exact truth of Green- 
wood and Yule’s assumptions of unchanged 
liability after accidents rather unlikely. Nev- 
ertheless, the available evidence strongly sug- 
gests that in the cases in which the negative 
binomial distribution fits the data the Green- 
wood-Yule assumptions may be viewed as ap- 
proximating the truth. The inferences trom 
these assumptions pertaining to the probable 
degree of accident liability which may corre- 
spond to given accident records then may be 
tentatively accepted as approximately true in 
many cases. 


Summary 


The classical assumptions of unchanged ac- 
cident liability after the occurrence of an ac- 
cident were provisionally accepted. Certain 
further implications of these assumptions were 


explored. The assumed distributions of acci- 


dent liability in groups of people were broken 
up into probable component distributions of 
liability for subgroups with given accident 
records. 


These component distributions were 
found to have the same form as the total dis- 
tribution if the latter is of type III. Quanti- 
tative examples of applications of this find- 
ing were given. It was pointed out that acci- 
dent records have some validity as indicators 
of accident liability, but that relative cer- 
tainty about high accident liability of par- 
ticular persons can be achieved in terms of 
their accident records only in a small mi- 
nority of cases. 


Received April 20, 1953. 
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It is often a task of the psychologist, in in- 
vestigations of relationships between various 
aspects of the operation or management of 
groups and the effectiveness of the groups, to 
find some way of measuring this effectiveness. 
When group effectiveness is the goal, these 
attempts at measurement can be extremely 
difficult. Many times they fall short of the 
mark and do not really measure what the 
managers of the groups would themselves call 
effectiveness. 

In a study of the relationship of various 
management factors to the relative safety ef- 
fectiveness of Army motor vehicle units, done 
under contract with the Department of the 
Army, the present authors and their col- 
leagues at Richardson, Bellows, Henry and 
Company came to grips with this knotty cri- 
terion problem. 

The primary objective of the study was to 
determine the relationship, if any, between 
the driving safety of motor vehicle units, and 
management and supervisory practices used 
in those units. The orientation of the present 
paper is limited to a summary of the attempts 
to define and measure the criterion variables. 

Although few psychologists have done in- 
tensive work in this area, the history of re- 
search on accident proneness, safe driving, 
and traffic problems is long and varied. Two 
excellent reviews are available: Johnson’s (1) 
and Lawshe’s (2). The authors of these re- 
views show that few investigators have dif- 
ferentiated between driving skill and safe 
driving, and that the early investigators con- 
centrated on “simpler” functions like depth 
perception rather than “higher” functions like 
abilities, attitudes, etc. 

‘The research presented in this paper has as 
its orientation the safe operation of a motor 
vehicle unit, rather than the more usual ori- 
entation of skill in driving. Thus, the first 


1The opinions and conclusions expressed in this 
article are those of the authors; they do not neces- 
sarily reflect official Department of the Army policy 
or the views of anyone other than the authors. 


question to which efforts were directed was: 
How could the units which were “high” and 
those which were “low” in safety of operation 
be properly identified so that another ob- 
server using the same procedures would ar- 
rive at the same identification? 

Preliminary investigation of accident rates 
of motor vehicle units led to the conclusion 
that, for the purposes of this study, such data 
were inadequate. Differences in definition of 
a reportable accident, in accuracy of mileage 
estimates, in mission, in equipment, traffic 
conditions encountered, etc., were factors in- 
volved. Added to this was the statistical un- 
reliability of reported accident rates for, say, 
50 vehicle motor units. 

Such information and evidence led toward 
ratings of safety of performance as a methced 
of identifying units for further study. In ad- 
dition, it was not feasible to restrict the study 
to motor vehicle units which were fairly com- 
parable in organization, equipment, mission, 
and conditions of operation. To produce 
really useful results, the study had to en- 
compass motor units as they occurred rather 
than as one might like to have them set up 
for a “tight” experimental design. 


Procedure 


The forms used were constructed on the 
basis of the preliminary surveys and the field 
tryout. The criterion procedure was as fol- 
lows: 


Relative ratings of over-all safety of operations 
of all motor vehicle units in an installation were 
asked for. Criterion rating sessions were held, 
attended by post or divisional staff officers, 
Provost Marshals, Safety Officers and Directors, 
and other persons who would have an acquaint- 
ance with the comparative performances of the 
motor vehicle units at the installation. Motor 
officers and the sergeants of the individual motor 
units did not attend these sessions but were asked 
to fill out the rating forms on their own units at 


the time of intensive study (not reported here) 


of their own units. 
In the criterion rating sessions three forms 
were filled out by the participants after pertinent 
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instructions. Since these meetings were informal 
gatherings, it was possible for the RBH field rep- 
resentative to monitor the rating procedures of 
the participants and thus ensure that the instruc- 
tions were being followed. Discussion of units 
being rated was not permitted; but otherwise 
conversation flowed freely during these sessions. 
The forms used were as follows: 

The Familiarity Rating Form (CRT-29, RBH 
Form R212—J) was constructed for the raters to 
indicate how well they knew each of the motor 
vehicle units at a particular installation. This 
form was designed to overcome the objections 
frequently made by those officers that they could 
not judge the over-all safety of a motor unit be- 
cause they were not sufficiently familiar with it. 
The form was used to identify the motor vehicle 
units with which they were best acquainted. 

While the use of the Familiarity Rating Form 
did not completely overcome their reluctance, it 
seemed to relieve some tension in the criterion 
rating sessions. The form merely listed the mo- 
tor units in the post or division, and the officers 
were instructed to rate their familiarity with the 
units as follows: 


“O”—if unfamiliar 

“1"if slightly familiar 

“2"—if familiar with the unit’s personnel, 
driving, and other factors to be rated. 


The Safety Factors Rating (CRT-26, RBH 
Form R212-G) was then given to the group 
members who were asked to rate, on 16 aspects 
of over-all safety, the six units with which they 
were most familiar. The raters who did not 
know six units well enough to rate them rated 
only those with which they had indicated fa- 
miliarity. 

The Criterion Ranking Form (CRT-27, RBH 
Form R212—H) was then administered. On this 
form the raters identified in order, from those 
with which they were familiar, up to six units 
which they thought were “best” from a viewpoint 
of all-around safety, and up to six units which 
were “worst” in all-around safety. 

From the analysis of these forms it was pos- 
sible to select a number of “high” safety and 
“low” safety units from each post or division. 
In many cases, upon further acquaintance with 
the unit (e.g., the Xth Ordnance Battalion), it 
was found that the unit which was selected as 
high or low really contained more than one motor 
unit (e.g., Companies A, B, C, and Hq.). In 
such cases, the Battalion, Regimental, or Group 
staff went through the criterion procedures as 
outlined above for the motor units under their 
cognizance. Company or Battery level motor 
units were selected from these ratings, with this 
limitation: only the units rated lowest were se- 
lected from Battalions, Regiments, or Groups 
previously rated low, and only the highest were 
selected from units previously rated high. 


A Check on the Criterion Groups 


In spite of the experience gained early in 
the study at various Army installations, which 
showed that the usual accident and mileage 
records were unsuitable for our purposes, an 
attempt was made to provide an objective 
criterion measure of this general type. 


It had originally been planned to collect acci- 
dent frequency statistics for the units. Review 
of unit safety records during the pilot field study 
had indicated that accident frequency statistics 
were inadequate to permit differentiation among 
relatively safe and unsafe motor units. The re- 
sults of this trial study, however, showed that 
many incidents, which could be construed as re- 
lated to the safe operation of motor units, were 
not being reported as accidents. To utilize this 
information, the Vehicle Damage Report (CRT- 
30, RBH Form R212-K) ? was devised. It was 
hoped that this form would provide a higher de- 
gree of objectivity and serve to substantiate or 
refute the selection of the units on the basis of 
the ratings and rankings. Data on the fre- 
quency of damages occurring within individual 
units, as an empirical measure of their safety of 
operation, were collected. The report was essen- 
tially a list of approximately 50 damages which 
could occur to a vehicle as a result of an impact. 
These were compiled and divided into nine gen- 
eral areas (e.g., Bumper Assembly, Body-Front, 
Body-Sides, Wheel Assembly, etc.). The list 
was further subdivided into types of vehicles. 

The list was administered as a group interview 
with the units’ motor sergeants, motor officers, 
mechanics, etc. Copies of the list were handed 
to each member of a group so that the list 
could serve as a stimulus to recognition and re- 
call of damages which had occurred to the unit’s 
vehicles during the preceding calendar month. 
They were enccuraged to use whatever records 
they had avai'able, and to consider each vehicle 
separately, one at a time. 


Analysis and Final Criterion Groups 


From two division and four post headquar- 
ters, 93 motor units were rated by varying 
numbers of raters. From these 93, 16 “high” 
units and 16 “low” units were chosen. The 
analysis of these criterion measures and how 
they were used in making the choices are 
given below. 

The scoring for the Safety Factors Rating 
was based on results of the preliminary field 
study at different installations in which an 

2 The authors are indebted to Mr. Warren R. Gra- 


ham of Richardson, Bellows, Henry and Company 
for the development of the Vehicle Damage Report. 
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empirical key was derived and shown to be 
related to the ranking of motor units (7 = 
46). Using all of the 93 rated units from 
the present sample, this empirical scoring was 
shown to be related to the rankings, r = .83. 
This high relationship may reflect consider- 
able “halo,” but as far as one can rely on the 
validity of the ratings in the criterion rating 
sessions, this consistency serves to substanti- 
ate the identification of criterion units which 
are definitely low or high. Whenever there 
was obvious disagreement between two or 
more equally qualified raters as to whether 
a unit was high or low, the unit was not in- 
cluded in the final criterion groups. When- 
ever there was inconsistency between a unit’s 
average score on the ratings and on the rank- 
ings, the unit was not included in the final 
groups. 

The field utilization of the Safety Factors 
Rating and the Criterion Ranking forms, by 
which the selection was made, had certain 
shortcomings. It was impossible to have each 
rater rate all the units—which would have 
been the most desirable procedure—because 
no raters were sufficiently familiar with the 
units to do so. It was a fortunate but infre- 
quent instance when three raters could rate 
the same unit. For this reason, many units 
which, on the basis of rating and ranking 
scores, would appear to be definitely “low” 
or “high” were rated by only one person, and 
so had to be discarded in favor of other units 
where two or more raters had agreed on the 
unit’s relative position in the installation. 

The research team discovered early in the 
field work that some criterion raters consid- 
ered themselves to be more or less qualified 
than other raters. Therefore, the qualifica- 
tions of the raters, as they were informally 
expressed during the criterion sessions, were 
also taken into account in selecting units, 
that is, when there was a question about wide 
deviations in the scores of the units. <A fur- 
ther consideration in the selection of units 
for intensive study of administrative prac- 
tices (not reported in this paper) was their 
representativeness in terms of number and 
kinds of vehicles, missions of the units, and 
special functions or hazards. 


Table 1 
Means and Standard Deviations of Scores for the 
High, Low, and Total Groups for Criterion 
Rankings and Safety Factors Ratings 


Rankings* Ratings 


Groups M o 


High (16 units) 37.9 
Low (16 units) 23.1 
Tota) (93 units) 30.5 


*Each unit’s rankings (by one or more persons) 
converted to a standard score scale that has a mean of 
30 and aa of 10. 


The means and standard deviations of the 
scores of the finally selected criterion groups 
are shown in Table 1. 

The mean rating and ranking scores of the 
93 units and related data are given in Ta- 
ble 2. 

In the selection of units, where a choice 
was possible, varied units were selected. This 
was done so as to include in the sample as 
many differently structured units with differ- 
ent missions as possible. Over-all, however, 
the selected high units and low units were 
similar. 

For the high group, 9 were post units and 
7 divisional units. For the low group, 10 
were post units and 6 divisional units. The 
breakdown (Table 3) shows the make-up in 
more detail. 

It seems to be established, therefore, that 
the selected high and low groups, while quite 
similar in types and numbers of vehicles, 
numbers of drivers, sizes of units, and mis- 
sions, were in reality different, presumably 
in terms of performance. 

Since the Safety Factors Rating was used 
by various personnel at the criterion sessions, 
it was. desirable to see if there were any sig- 
nificant differences in the way in which vari- 
ous groups rated. In other words, were the 
ratings as a whole homogeneous? This ques- 
tion resolved itself into the testing of the 
hypothesis that the ratings of four groups of 
raters were random selections from the same 
universe. The four groups in question were: 
Group A. Provost Marshals, Safety Officers, 
and Directors; Group B. Post Ordnance, 
Maintenance, Transportation and Motor Of- 
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Table 2 


Mean Ratings of Units by Higher Echelon Criterion Raters 


Mean 


Unit Rating Rank Selected 


Rating Rank** Selected 


—2 33 48 7 22 
—1 27 49 11 40 
—{ 18 50 2 21 
13 38 51 11 36 
7 30 52 10 33 
11 31 53 10 43 
10 39 54 25 
24 55 24 
40 56 24 
33 57 28 
25 58 
25 59 

40 
43 61 

35 

20 

25 

30 

35 

25 

38 


oc 


meoaonvortoveot & 


25 
18 
39 
30 
41 
32 


22 


25 
31 


34 


24 
30 


Z 
= 
3 
° 


34 
19 
29 
24 
36 
29 
27 
25 
33 
39 


41 
42 
43 
44 
45 
46 
47 


ee onowuwuvsd w 





* Only one rater rated the unit. 
** Converted to standard scores as in Table 1. 
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Table 3 


Kinds of Units in the Sample 


High 


Group 


Low 


Kind of Motor Unit Group 


Car Company 2 
Truck Company 1 
Ordnance Company 3 
MP Units 3 
Engineer Unit 1 
QM Company 1 
Signal Company 1 
FA Battalion 1 
HQ Company 1 
Administrative Motor Pool 2 
Ambulance Company 

Heavy Tank Battalion 

Antiaircraft Battalion 


ficers; Group C. Staff Officers and others not 
classed in Group A, B, or D; and Group D. 
Unit Motor Officers and NCO’s. 

Because of the possibility that there might 
be a difference in the homogeneity of ratings 
among the groups when rating high units as 
opposed to when rating low units, it was de- 
cided to analyze the ratings of high units 
separately from the ratings of low units. 
Analysis of variance was employed to test 
the above hypothesis. Of 16 items tested, 
F ratios for low units were significant at 
P < .01 for ten items; for high units, only 
two items were rated differently by the 4 
groups. 

A second item analysis of the Safety Fac- 
tors Rating was made in which the responses 
of the motor officers and motor sergeants 
(Group D) were compared with all of the 
criterion raters’ responses together (Groups 
A, B, and C). The results of this analysis, 
using Strong’s method (3) to obtain response 
weights, and testing for significance with chi 
square, showed that a significant difference 
existed between the higher echelon officers 
and the motor unit leaders on each of the 16 
items. 

It seems, therefore, that the motor officers 
and sergeants do disagree with the criterion 
raters in rating their own units. Linked with 
thé results of the item analysis of variance, 
this means that motor officers and sergeants 
‘rate their units differently than do the higher 


echelon officers (criterion raters). This was 
noticed on inspection of the response fre- 
quencies of the two groups: the criterjon rat- 
ings are consistently lower than the ratings 
by the motor unit’s own personnel. It is 
likely that this difference is due to a typical 
overrating of one’s own organization which 
might be expected from the motor officers 
and motor sergeants. They are unable to 
place realistically their own unit in the con- 
text of units at the installation. This differ- 
ence in groups, however, is much more evi- 
dent in the ratings of low units than in the 
ratings of high units, since the analysis of 
variance showed 10 of the 16 items to be 
rated differently by the groups in rating low 
units, opposed to only two items when rating 
high units. 

Leaders of low units, both officers and 
NCO’s, are less able than high unit leaders 
to place their unit realistically relative to the 
other motor units on the post with regard to 
the over-all safety of the unit. 

In order to determine which of the factors 
rated on the Safety Factors Rating seemed to 
be differentiating between high and low units, 
a further item analysis was made, in which 
the responses of the motor officers and motor 
sergeants from the high units were compared 
with those from the low units. For this com- 
parison to be made, it was necessary to com- 
bine the ratings of NCO’s with those of the 
motor officers. This was possible since the 
functions of the two groups were closely in- 
tertwined in the administration of motor 
pools. The item analysis comparing the re- 
sponses of all motor sergeants with all motor 
officers (rating their own units) showed that 
statistically there was no reason to suppose 
that the former group had rated their units 
differently than the latter group. 

The item analysis of motor officers’ and 
motor sergeants’ responses comparing high 
and low groups showed no marked differences 
in answering the questions. Essentially, this 
means that the motor officers and motor ser- 
geants rate their own units the same regard- 
less of their unit’s relative position in over-all 
safety as rated by the higher echelon officers. 
This corroborates the analysis of variance re- 
sults reported above, in which it was seen that 
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Table 4 


Percentage of Vehicles Damaged per Vehicle Operated in Period 


High 


Units 
Oper- 
ating 


Units 
Number Number Per Cent Report- 
Oper- Dam- Dam- ing 
ators ages ages 


Vehicle 
Type 


} ton 419 26 6.2 14 
i ton 61 7 11.5 13 
14 ton 73 0 7 
2} ton 297 15 2 15 
Sedans 101 14 } 4 
Misc. 78 9 , 12 


Total 1,029 7 16 


the unit leaders consistently rate their own 
units high on the Safety Factors Rating, re- 
gardless of where the criterion raters place 
the unit. 

The Vehicle Damage Report was analyzed 
to determine if the numbers and kinds of 
damages, as recalled and reported in an in- 
terview situation by the units’ mechanics, 
would show differences between high and low 
units. 

The numbers of damages incurred by the 
high and low groups were summed separately 
by types of vehicles. The numbers of dam- 
ages by type for each criterion group were 
equated to the number of vehicles of that 
type operated and maintained. In addition, 
the numbers of damages were adjusted to the 


Damages Vehicles 


Low 


Units Units 
Number Number Per Cent Report- Oper- 
Oper Dam- Dam- ing ating 
ators ages ages Damages Vehicles 


337 25 7.4 10 15 
71 6 8.5 3 13 
70 28.6 1 7 

312 6.1 16 
50 16.0 3 

128 ; 21.9 ; 13 


968 11.0 16 


number of trips made in one month. The re- 
sults are summarized in Tables 4 and 5. 

Table 4 shows a significant difference be- 
tween the percentages of 114% ton vehicles 
damaged in the high group: none, as com- 
pared to 28.6% damaged in the low criterion 
group. It should be noted, however, that 
only one unit (which reported 20 accidents 
to vehicles of this type) caused this signifi- 
cant difference. The difference is seen also 
when relative amount of use is considered by 
adjusting damages to the number of trips 
made during the period by 11% ton vehicles 
(Table 5). 

A second significant difference occurs be- 
tween the percentages of vehicles damaged in 
the miscellaneous category (heavy engineer- 


Table 5 


Percentage of Vehicles Damaged per Trip Made during the Preceding Month 


High 


Units 
Oper 
ating 


Units 
Report 
Dam- ing 
ages 


Number Per Cent 
Dam 
ages 


Vehicle 
Type 


Number 
Trips 


6,007 26 4 6 
903 7 8 2 
836 0 0 0 

24 ton 3,916 15 4 7 

Sedans 1,382 14 3 

Misc. 893 3 


} ton 
} ton 
1} ton 


Total 13,937 . 71 4 11 


Damages Vehicles 


Low 


L nits 
Oper 
ating 
Damages Vehicles 


Units 
Report 
Dam ing 
ages 


Number Per Cent 
Dam 
ages 


Number 
Trips 
3,579 25 10 15 

572 6 
980 20 
2,536 
886 
2,252 


10,805 
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ing equipment, ambulances, wreckers, trail- 
ers, etc.). When relative use is considered, 
however, the significance of this difference 
disappears, since the low groups use these ve- 
hicles more frequently than do the high 
groups. 

When the difference between percentages 
of all types of vehicles operated during the 
period is considered (Table 4), it is found to 
be significant. Moreover, this difference re- 
mains significant when the relative frequency 
of use is considered (Table 5). 

These results are interpreted to mean that 
the low criterion units are relatively unsafe as 
compared to the high units. The ability of 
the high echelon officers to make criterion 
rankings and ratings in terms of safety of 
unit operation is substantiated and it is con- 
cluded that the subjective criterion has real 
validity. 


Summary 


The development of criterion measures of 
safety of operation for groups reported in this 
paper proceeded from a consideration of 
previous measures reported in the literature, 


to utilization of rating and ranking pro- 
cedures to obtain preliminary criterion groups 
of motor vehicle units. The criterion was 
not accepted as valid, however, until an in- 
vestigation of damages showed a relationship 
to the preliminary grouping of units. It is 


the authors’ opinion that criteria derived 
from ratings or rankings should be verified 
by showing them to be related to some criti- 
cal behavioral aspects of effectiveness, ac- 
ceptable to the psychologist, to the raters, 
and to the groups being studied. 


Received April 9, 1953. 
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Recently several instruments involving opti- 
cal simulation of distance have been devel- 
oped for large scale acuity testing. Among 
such devices are the Bausch and Lomb Ortho- 
Rater, the Keystone Telebinocular, and the 
American Optical Company Sight-Screener. 
These instruments provide means of present- 
ing tests of right eye, left eye, and binocular 
acuity, as well as vertical and horizontal 
phoria, stereopsis, and color vision. Both 
near and far simulated distances may be used. 

For the measurement of far visual acuity, 
optical instruments have several advantages 
over the usual method of wall chart or alley 
testing. The light source is “built in” and, 
therefore, can be made relatively accurate. 
Alley charts, on the contrary, vary widely in 
conditions of illumination. The viewing dis- 
tance of instruments is achieved optically, 
with consequent economy of testing space. 
Targets may be conveniently changed with- 
out crossing the testing room. And of course, 
a variety of visual functions may be tested 
on the same instrument. 

Before any one of these new instruments 
can be considered seriously for extensive 
visual testing, it should be compared with 
wall chart presentation. This study should 
be made on the basis of relative difficulty, re- 
liability, and similarity of functions meas- 
ured. The present paper deals with these 
problems; a comparison is made between 
acuity scores on wall charts and on the 
Bausch and Lomb instrument test. 


Review of Literature 


The reliability of wall chart tests of far visual 
acuity has been determined (2). Data are also 
available on the reliability of instrument tests 
(1, 3). A rigorous comparison between these re- 
liabilities cannot be made because of differences 
in the populations, test targets, and light levels 
employed. A study by Sulzman, Cook and 


*Any opinions expressed herein are those of the 
authors and do not necessarily reflect those of the 
Department of the Army. 
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Bartlett (6) did employ the same subjects in 
comparing the reliabilities of instrument and of 
wall chart tests. The instruments they employed 
included the Sight-Screener, the Ortho-Rater, 
and the Telebinocular. It was found that the 
reliabilities of the letter wall chart tests were 
about the same as those of the instrument tests. 
They ranged from .80 to .88 for the two wall 
chart tests, and between .81 and .85 for the 
three instrument tests. In near visual acuity 
testing, reliabilities were also similar. The wall 
charts, however, seemed to be testing a visual 
function somewhat different from that of the in- 
strument tests. The correlation between the let- 
ter wall chart tests was considerably higher thar 
that between wall and instrument tests. If those 
correlations had been corrected for attenuation, 
the difference would be even larger. The authors 
conclude that these results may be due to the in- 
troduction of some new factor related to the 
optical system of the instrument or to the fact 
that different targets are used in the various tests. 

Altman and Rowland (3) determined the re- 
lationship between scores obtained on an Ortho- 
Rater and a wall chart when the same target was 
used.: The wall chart was an accurate enlarge- 
ment of the plate reproduced for presentation 
at 20 feet. One hundred and fifty-seven eyes 
were tested without refractive corrections in 
order to secure a wide range of acuity scores. 
A correlation of .94 was obtained be' ween acuity 
scores on the Ortho-Rater and wall chart tests 
This study presents supporting evidence of the 
identity of the visual abilities measured by the 
two methods. 

In the present experiment, an attempt was 
made to compare the test-retest reliabilities and 
to obtain a measure of the correspondence be- 
tween scores on Ortho-Rater and wall chart 
tests. The same subjects, targets, and light 
level were employed in both methods of pres- 
entation. The conditions of luminance and con- 
trast between object and background were eaual- 
ized as closely as possible. With control of these 
conditions, more definitive conclusions may per- 
haps be reached concerning the reliabilities of the 
two presentation methods, and the presence or 
absence of the “apparatus accommodation” fac- 
tor thought by some to affect machine scores (5). 


Method and Procedure 
The present experiment was conducted at 
the Personnel Research Branch’s Pentagon 
Laboratory in Washington, D. C. 





Visual Acuity Measurements 


Army Snellen 
Fic. 1 


The subjects were 117 soldiers from Fort Myer, 
Virginia. Soldiers varied in age between 19 and 
37 years, with the mean age at 22.4 years, and 
a standard deviation of 2.6 years. The test tar- 
gets were observed binocularly. All subjects who 
customarily wore corrective lenses used them in 
the experiment. 

Twenty examinees who either reported having 
trouble with their eyes such as irritation, water- 
ing, or fatigue, or who reported having driven 
the night before, were considered as a special 
group. The decision was made to include the 
results of this group in the analysis after no sig- 
nificant differences in estimates of reliability were 
found between this gtoup and the soldiers re- 
porting no eye trouble. 

The test designs included a letter chart and a 
modified Landolt ring chart. Samples of the 
items in these targets are shown in Figure 1. 

The letter chart was a modification of the 
Snellen chart employed by the Army in routine 
visual acuity examinations. Items were added 
to give more adequate discrimination where too 
few items and sizes were found on the old test 
The chart consisted of 12 lines of letters ranging 
in size from 20/100 to 20/7.1 Snellen. The 
modified Landolt ring chart presented a square 
target rather than the circular design used in the 
original ring. The chart contained 11 lines of 
items, ranging in size from 20/135.2 to 20/5.9 
Snellen. 

The Ortho-Rater plates were made from the 
wall charts by a double reduction photographic 
process. It was intended to reduce the wall 
charts (constructed for testing at 20 feet) to 
0555 of the original size. Actually the reduc- 


1 The authors wish to acknowledge their indebted- 
ness to Mr. Owen Conger, Typo and Design Unit, 


Army Publication Service Branch, TAGO, for his 
careful drafting of these vision targets. 

* This reduction ratio is employed by the Bausch 
and Lomb Company in the manufacture of Ortho- 
Rater plates. It is based on an estimated distance 
of 40 mm. from lens surface to the eye, and 362 mm 
from the far plate to the eye 


Modified Landolt Ring 


Visual acuity items 


tion ratios, as determined by an optical com- 
parator with microscopic attachment, are: letter 
plate, right eye .0546, left eye .0545; Landolt 
plate, right eye .0552, left eye .0549. The visual 
angles corresponding to the reduction ratios of 
the Ortho-Rater letter targets are slightly smaller 
than those of the counterpart wall chart; the 
visual angles of the Ortho-Rater Landolt targets 
are almost identical to their wall chart. 

The laboratory in which testing took place 
was constructed in conformity to specifications 
formulated by the Armed Forces—NRC Vision 
Committee. The viewing distance was 20 feet 
for wall chart testing. Illumination was fur- 
nished by three overhead lights in flashed opal 
glass fixtures. These fixtures were evenly spaced 
along the testing alley. The front of the alley, 
sides, top, and floor were covered by white osna- 
burg cloth which served to provide an evenly lit 
surround over the visual field 

The brightness of the wall charts and Ortho- 
Rater plates was 13.5 millilamberts. A MacBeth 
Illuminometer was employed in making light 
measurements. In calibrating the brightness of 
the Ortho-Rater plates, observations were made 
against a blank plate with the eyepiece of the 
instrument removed. A correction was added to 
adjust for loss of light to be expected in trans- 
mission through the eyepiece. The required 
Ortho-Rater and wall chart brightnesses were 
secured before each session, by use of a volt- 
meter and a continuously variable resistance 
(variac). 

Before being tested, each subject was shown 
sample targets of the designs to be used. The 
testing procedure was carefully explained. It 
was emphasized that he was to keep reading each 
test until told to stop. The subject was encour- 
aged to guess if he was not sure 

The examiner observed the subject at all times 
to make sure that he did not squint or view the 
charts obliquely. The subject was rested from 
time to time. Responses were transmitted elec- 
trically to an adjacent room where they were 
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checked by a technician and recorded on pre- 
pared answer forms. 

The following presentation order of tests was 
maintained: wall chart letter, wall chart modi- 
fied Landolt, Ortho-Rater letter, Ortho-Rater 
modified Landolt. These tests were a portion of 
a larger group of 17 mesopic and photopic targets 
given in the same session. Subjects had ob- 
served five mesopic wall charts and two mesopic 
Ortho-Rater plates before taking the four tests 
discussed here. The letter wall chart was the 
third test given on the photopic level, the modi- 
fied Landolt wall chart was the fifth, the Ortho- 
Rater letter plate was the eighth, and the Ortho- 
Rater modified Landolt plate was the ninth of 
the ten tests given at the photopic level. The 
same procedure was followed in the retest ses- 
sion two weeks later. 


Results 


An indication of the relative difficulty of 
wall chart and Ortho-Rater presentation is 
shown in Table 1. The mean represents the 
average number of items achieved by the sub- 
jects before the criterion of failure was met. 
These results are presented for four scoring 
methods: 

(a) Number of rights before two consecu- 
tive miscallings were first made; (b) Number 
of items attempted before two consecutive 
miscallings were first made; (c) Number of 


rights before three consecutive miscallings 
were first made; and (d) Number of items 
attempted before three consecutive miscall- 
j «were first made. These were utilized to 
». ¥ the effect of scoring method on results 
and, thus, give the results wider generality. 
It will be recognized that these scoring meth- 
ods are non-independent measures. 

It may be seen that subjects were able to 
read further on the wall chart letter tests 
than on the Ortho-Rater letter plates before 
meeting the criterion of failure. This differ- 
ence in difficulty may perhaps be explained 
by the somewhat larger visual angle of the 
letter wall charts (see Method and Pro- 
cedure). The scores on the Landolt tests, 
where more perfect reproduction of visual 
angle was achieved, are about equa! for the 
two methods of presentation. he standard 
deviations are approximately the same, ex- 
cept that the Ortho-Rater Landolt retest 
shows greater variability than its wall chart. 
Although there are several significant differ- 
ences in means and standard deviations of 
the two methods of presentation, the differ- 
ences are too small to be of practical impor- 
tance. In Snellen acuity units, negligible 
changes in scores are implied. As shown in 


Table 1 


Comparison of Means and Standard Deviations for Wall Chart and Ortho-Rater Tests (N = 117) 


Mean 


Ortho- 
Rater 
62.0 
64.2 
63.3 
66.3 


Wall 
Chart 


Scoring 
Method 
A 62.6 

B 64.8 

- 65.0 
68.9 


63.3 
65.6 
65.7 
69.6 


61.4 
63.7 
63.6 
67.6 


Letter 
(Retest) 


60.9 
62.3 
63.3 
66.8 


61.0 
63.0 
63.5 
67.5 


Landolt 
(Test) 


62.0 62.7 
63.8 64.7 
64.3 65.1 
67.8 69.1 


Landolt 
(Retest) 


Standard Deviation 
t* Wall Ortho- ° 
Ratio Chart Rater Ratio 
1.00 11.0 11.1 0.15 
1.14 11.0 11.5 - 0.82 
3.07 10.4 11.1 1.46 
4.53 10.8 11.8 1.74 


3.61 11.5 11.6 
3.26 12.2 12.0 
3.94 11.0 10.8 
3.84 11.5 11.2 


0.32 
0.40 
0.32 
0.63 


0.19 11.7 12.5 1.10 
0.90 12.0 12.8 1.07 
0.26 11.3 12.2 1.35 
0.94 12.0 12.9 1.31 


1.12 11.3 13.2 2.96 
1.44 11.7 13.2 2.31 
1.15 11.3 13.7 3.64 
1.7! 12.3 14.6 3.32 





*A t ratio of 1.96 indicates that the difference obtained is significant at the 5 per cent level of confidence. 
A t ratio of 2.58 indicates that the difference obtained is significant at the 1 per cent level of confidence. 
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— —- — ORTHO-RATER 


NUMBER OF CASES 

















18-26 27-35 36-44 45-53 54-62 63-71 72-80 
SCORE INTERVAL 
Fic. 2. Distribution of scores on the new Army 
Snellen tests. Items 30-39 of the tests are 20/20 
acuity value (N = 117). 


Figures 2 and 3, the test distributions are 
very similar. In general, the evidence does 
not indicate that Ortho-Rater and wall chart 
presentations differ greatly in difficulty and 
variability. 

The test-retest reliabilities of wall chart 
and Ortho-Rater scores are shown in Table 
2. All Ortho-Rater reliabilities, with one ex- 
ception, are significantly higher than those of 
the wall charts. 

The higher reliabilities of the Ortho-Rater 
plates cannot be explained by the fact that 
the Ortho-Rater plates were administered 
after the wall charts. If increased reliability 
is associated with later tests administered in 


——— 


~ — — ORTHO-RATER 


4 


EL -f-+-4 —_ 


30-38 39-47 48-56 57-65 66-74 75-83 84-92 





SCORE INTERVAL 


the 
the 


Fic. 3. Distribution of scores on 
Landolt ring tests. Items 38-45 of 
20/20 acuity value (N = 117) 


modified 
tests are 


the session, the Armed Forces Far Visual 
Acuity test, administered twice, should have 
shown this effect. This test was administered 
as the second and tenth (last) test of the 
photopic series. The test administered in the 
last position shows a significant decrease in 
reliability for two scoring methods and a non- 
significant increase for two other methods. 
The correlations between scores on the wall 
charts and on the Ortho-Rater plates are pre- 
sented in Table 3. These correlations are 
based on scoring method C, which was the 
most reliable method employed (see Table 2). 
They are about as high as the test-retest re- 
liabilities. The mean of the correlations is 
equal to .83; the mean of the reliabilities of 
scoring method C is equal to .85. The mean 
of the correlations, corrected for the attenuat- 
ing effects of unreliability in each variable, is 


Table 2 


Wall Chart and Ortho-Rater Test-Retest Reliabilities (N 


Letter Test 


Ortho hg 
Rater Ratio 


Scoring ; 
Method Wall Wall 
3.30 aa 
3.69 69 
2.04 75 


2.23 


A 81 
B 78 
Cc 88 
D 80 


20 


87 65 


Landolt Test 


= 117) 


Far Visual 
Acuity Test 
Ortho z* 


Rater Ratio Second Tenth Ratio 


381 1.94 20 &3 
79 2.14 &8& 81 
85 2.82 81 86 
79 2.91 80 81 


2.72 
2.43 
1.55 
0.29 


~ * Tn this context a t ratio of 2.58 indicates that the difference is significant at the 1 per cent level of confidence 


A t ratio of 1.96 indicates that the difference is significant at the 5 per cent level of confidence. 


In computing the 


t ratios, the correlation between z-transformations of the reliabilities was estimated at .40 for all comparisons by 


an approximation formula described by McNemar (4, p. 125) 


? 


“fa “NV- 


2r 


formula («. 


XA standard error of .103 was obtained from the 


“) used in obtaining the t ratios. 
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Table 3 
Correlations between Wall Chart and Ortho-Rater Tests (Scoring Method C) ' (N = 117) 








Test 


Variables Session 


O-R Test 
Session 
and Wall 
Retest 
Session 


Wall Test 
Session 

and O-R 
Retest 
Session 


Retest 
Session 





Wall chart vs. Ortho-Rater 85 
(Letter) (.94) 


Wall chart vs. Ortho-Rater .78 
(Landolt) (.98) 


87 86 89 
(.97) (.96) (.99) 


84 47 80 
(1,00) 


(.96) (1.00) 





' Correlations corrected for attenuation are given in parentheses. 


.98. These data offer little evidence to sup- 
port the existence of a machine factor or “ap- 
paratus accommodation factor” specific to 
Ortho-Rater presentation. 


Discussion 


The finding that Ortho-Rater tests are 
more reliable than wall chart tests presents a 
problem for interpretation. The superiority 
of instrument presentation may be due to 
lowered visual distraction with limitation of 
the surround, or to some other advantage of 
subject or stimulus characteristic leading to 
greater constancy of conditions. It is be- 
lieved that the difference in reliability be- 
tween Ortho-Rater and wall chart presenta- 
tion will be even greater in operational test- 
ing than that found here. It is well known 
that the conditions of wall chart testing differ 
widely from place to place. 

If visual angle, background luminance, and 
contrast between object and background are 
equated as closely as possible between Ortho- 
Rater and wall chart presentations, closely 
equivalent measures are obtained. The diffi- 
culties of the tests and the variability of 
scores are similar. When the correlations of 
the tests are taken into consideration, the 
methods appear to measure the same visual 
abilities. 


Summary 


This study presents a comparison of visual 
acuity scores obtained on Ortho-Rater plates 
with visual acuity scores on duplicate wall 
chart tests. A total of 117 subjects were 
tested binocularly and ‘ctested two weeks 
later. Letter and modified Landolt ring tar- 


gets werc employed. Previous practice had 
been given on other mesopic and photopic 
wall chart and Ortho-Rater plates before the 
tests under consideration were given. 

The following results were obtained: 

1. The two methods of presentation were 
of equal difficulty, except for slight discrep- 
ancies introduced by photographic reduction. 

2. The reliabilities of the Ortho-Rater tests 
were significantly higher than those of the 
wall chart tests. 

3. The correlations between Ortho-Rater 
and wall chart tests were about as high as 
the reliabilities of the tests themselves. When 
corrected for attenuation, these correlations 
approach unity. No evidence is afforded, un- 
der these conditions, of a machine or “appa- 
ratus accommodation” factor affecting Ortho- 
Rater acuity scores. 
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Effect of Illumination on Scores with Instrument Acuity Tests 


Newell C. Kephart and Stanley Deutsch 


Occupational Research Center, Purdue University 


Standardized tests for the measurement of 
visual acuity have often been administered 
under conditions which were not held con- 
stant, both within tests and between tests. 
The standardization procedure has been vio- 
lated in many different ways; varying light 
conditions, the uncontrolled mixture of arti- 
ficial and sunlight, angles of view, and the 
like. Self-contained instruments such as the 
Bausch and Lomb Ortho-Rater help to over- 
come these undesirable deviations in practice 
which serve to reduce the validity of the meas- 
urements (1, 2). 

However, it is realized that despite precau- 
tions, occasional variations in the source of 
power in the industrial establishment may 
lead to greater or lesser illumination even 
within the Ortho-Rater than that considered 
to be standard for testing visual skills. This 
experiment was designed to determine whether 
or not such deviations of lighting present seri- 
ous drawbacks to obtaining accurate meas- 
ures of visual acuity. 


Procedure 


A total of 55 college students in a course 
in general psychology were tested on a stand- 
ard Ortho-Rater, using standard practice for 
administration (3). The illumination in the 
instrument was varied by means of a Variac 
rheostat. The only deviations from standard 
testing procedure were these changes in the 
illumination levels which were artificially in- 
duced external to the instrument to simulate 
conditions which might occur in an industrial 
situation. 

Acuity was measured at both the near and 
far point and a total of five levels of illumina- 
tion was used at each distance. Since the 
Ortho-Rater had just been checked and the 
bulbs replaced where necessary, the normal 
level of illumination was used as a base of 
100. Using an illumination level meter to 
determine amount of illumination, three lesser 
amounts of illumination and one greater quan- 


tity were obtained by means of the rheostat. 
This procedure provided illumination for the 
targets in the following percentages of stand- 
ard for far acuity: 12, 56, 75, 100, and 125. 
For near distance the percentages were: 10, 
46, 75, 100, and 125 of the normal illumina- 
tion. 

The stimuli used were the “both eyes” and 
the “right eye” targets. The left eye was oc- 
cluded at all times by a mechanical device 
built into the machine. The targets and levels 
of illumination were presented in a random 
order throughout the experiment, and were 
changed for each subject. Although only the 
right eye saw the material the “both eyes” 
targets were used in addition to the “right 
eye” targets in order to provide additional 
data. 


Results 


The mean acuity scores obtained by the 55 
subjects are shown in Table 1. For the target 
at the optical distance of 26 feet, statistically 
significant differences at the 1% level of con- 
fidence were obtained only for the targets em- 
ploying 10 and 12% of the standard illumina- 
tion. For all of the presentations where the 
illumination was 46% or greater, no statis- 


Table 1 


“?”” Ratios Between Acuity Test Scores at Various 
Levels of Illumination and a Standard 
Level of Illumination 


Target 


Level of Both Eyes 


Right Eye 
Illumi- - -— - 


Near Level 
10% 11.21 1% 7 1% 
12% 5.91 1% i 1% 
46% 38 1% ; 1% 
56% NS. N.S. 
75% 35 NS N.S, 
75% . N.S N.S, 

125% N.S. N.S. 

125% 5 N.S 


nation Far Near Level 
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tically significant differences were found for 
this group. 

The decrease in illumination for the near 
targets (distance of 13 inches) was somewhat 
more critical. When the lighting was reduced 
as low as 46% or less of standard, differences 
were obtained at the 1%% level of confidence. 
No significant differences were found for those 
targets receiving 75% or more illumination. 

When 125% of the prescribed lighting was 
employed, the differences from normal illumi- 
nation were not significant. 


Discussion 


The study was undertaken to determine 
whether significantly different results would 
be obtained if the levels of illumination were 
increased or decreased from that established 
by the manufacturers of the Ortho-Rater. 
Several factors which might potentially in- 
validate the results of instrument acuity tests 
can be hypothesized. These might consist of 


temporary fluctuations in the power supply in 
a factory test room, reduced efficiency of the 
Ortho-Rater light sources prior. to examina- 


tion, or any feature which might change the 
illumination level within the instrument. 

Effects of such deviations, however, appear 
to be minimal. A decrease of more than one- 
fourth is required before the differences in 
results become meaningful. 

It is of interest to note that near acuity 
suffers more readily than far acuity. Whether 
this indicates that near visual tasks require 
more illumination than far visual tasks can- 
not be answered from the information pre- 
sented here. 


3. Standard Practice 


The 25% increment in illumination used to 
augment the standard at no time produced a 
significant change in visual acuity scores. 

It is a standard procedure for Ortho-Rater 
operators to check the operational condition 
of their instrument prior to use. As a con- 
sequence, chances of using a bulb operating 
at only 50% of efficiency are slight. Devia- 
tions in the power supply of this magnitude 
are instantly obvious, and testing should 
await the return of the more nearly normal 
level of illumination. This study demon- 
strates that minor changes in illumination do 
not appear to have any real effect upon the 
test results. 


Summary 


1. Decreases in illumination as great as 
one-fourth of standard did not affect scores on 
visual acuity with the Ortho-Rater. 

2. Increases in illumination as great as one- 
fourth of standard did not affect these acuity 
scores. 

3. Near acuity scores suffer to a greater 
degree than far acuity scores when illumina- 
tion is decreased more than 25 per cent. 
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Applied Psychology in Action 


Psychological Research in Personnel Administration 


Joseph G. Colmen 


Civilian Personnel Research Branch, Headquarters, USAF 


Industrial management has undoubtedly 
been skeptical about the value of the person- 
nel psychologist as a direct part of its opera- 
tions payrolled as its job evaluation, training, 
organization and methods and other func- 
tions are. Yet a large number and variety of 
management problems can be attacked by the 
application of the specialized skills of the re- 
search psychologist. And in most cases not 
only can they provide the most valid solu- 
tions and recommendations but can do this in 
a manner which will please even the most 
practical administrator. ‘To do this, it seems 
important for the research psychologist to be 
close enough to the management and opera- 
tions of the organization so that he can sense 
needs for research in day-to-day problems. 
And he can make acceptable recommehda- 
tions for application of research results in the 
same setting. The possibilities for success 
are greater, of course, where the relationship 
between administrator and psychologist is a 
close and continuing one. 

The Civilian Personnel Research Branch 
(CPRB) of the U.S. Air Force Headquarters 
is in the fortunate position of approximating 
this ideal. This Branch conducts psychologi- 
cal research originating from everyday prob- 
lems of the civilian personnel program of the 
Air Force. 

Many problems had pointed to the need 
for test development when the CPRB was es- 
tablished in 1950. One of the urgent prob- 
lems was to find more objective means for de- 
termining supervisory potential to improve 
the level of supervisory proficiency within the 
Air Force, especially in the important area of 
human relations. Another problem was to 
find an objective measure of potentiality for 
administrative work as a basis for selecting 
junior employees for specialized development 
and training. 

These problems demanded pioneering re- 
search in areas where success had been previ- 


6l 


ously only hopefully promising. Fortunately, 
management recognized that scientific re- 
search took time. It did not insist on “quick” 
results at the sacrifice of adequate research. 

As work on these basic problems progressed, 
other needs became evident. It was noted 
that from time to time Air Force installations 
were conducting attitude surveys among ci- 
vilians without the benefit of instruction in 
accepted principles of public opinion polling 
Guidance was needed in formulating ques- 
tions, sampling employees, conducting sur- 
veys, and analyzing and interpreting results. 
A compact, highly functional guide to the 
conduct of civilian employee attitude surveys 
was prepared to correct these deficiencies to- 
gether with a questionnaire with general ap- 
plicability at all Air Force bases. 

Because test administration would be un- 
dertaken by all Air Force bases upon comple- 
tion of the research, a guide on establishing 
test administration facilities and on the best 
methods of administering and scoring tests 
was developed. Also, regulatory material 
was published to assure effective coordina- 
tion and highest quality of research through- 
out the civili personnel program and _ to 
stimulate necessary personnel research where 
resources permitted. 

To conserve 
made wherever 


research use was 
sssible of the work of other 
organizations with adaptation and _ special 
check and _ validation accomplished 
within the Air Force. On the other hand, no 
test is authorized for use without specific 
validation on the group for which it is in 
tended or for a purpose other than that fo, 
which it was validated. 

With what has been a very heavy schedule, 
new developments in technical areas have still 
been possible. Readers of this paper trained 
in test theory will recognize that depth, 
quality, and originality of research have not 
been sacrificed even in the applied setting in 


resources, 


being 
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which it is conducted. A nomograph for de- 
termining significance of difference between 
percentages for finite populations was devel- 
oped for handy use by statistically untrained 
people in analyzing attitude survey data. An 
“unconventional” key and rationale for it 
was hypothesized and verified as a valid key- 
ing technique for personality test items. The 
methods of weighting tests by precise Wherry- 
Doolittle beta weight methods or Wherry- 
Gaylord integral weighting were found to be 
less appropriate under certain circumstances 
than mere unit weighting of tests so selected. 
Later discussions with Dr. Wherry have con- 
firmed that the number of cases and size of 
test intercorrelations do affect the stability 
of weights derived by these methods. 
Awareness of the specialized skills brought 
to the organization by research psychologists 
soon led personnel specialists in other pro- 
gram areas to seek the services of the Ci- 
vilian Personnel Research Branch. Typical 
assignments in other areas of applied research 
were: to determine the advantages and limita- 
tions of functional music in the work setting; 
to develop sampling methods in connection 
with. interviews used as part of an evaluation 
of the effectiveness of civilian personnel pro- 
grams at Air Force bases; to determine the 
best sources and methods of ascertaining su- 
pervisory training needs as a basis for de- 
veloping training course content; to deter- 
mine the extent and causes of turnover among 
civilian personnel office professional staffs; to 
develop a battery for selection of fiscal ac- 
counting clerks; to evaluate effectiveness of 
employee suggestion systems; to make recom- 
mendations concerning the most valid and re- 
liable information about personal character- 
istics of candidates for positions from inter- 


views, vouchers and other screening methods; 
with the Civil Service Commission and four 
other federal agencies, to develop selection 
methods for reducing the number of emo- 
tional misfits finding their way into overseas 
jobs; and others. 

Though research has constituted the major 
responsibility of the CPRB, it has never di- 
vorced itself from the personnel administra- 
tion of which it is.a part, so that research 
needs are perceived by the psychologist in 
the operating and staff civilian personnel 
problems with which he is in close contact. 
Nor is the work completed when research 
findings are reported. Instead, implementa- 
tion of those findings in the practical setting 
of the operating civilian personnel office be- 
comes in large part also the responsibility of 
the researcher. And he is kept informed of 
and asked to comment on personnel adminis- 
tration programs, policies and procedures 
which are under consideration in the Direc- 
torate of Civilian Personnel. 

The satisfaction of management with the 
accomplishments of its personnel research 
function is seen in continued support and 
growing acceptance. By keynoting economy 
and improvement of operations, data have 
been accumulated showing how the results 
of the work of CPRB have much more than 
offset its modest cost. It is interesting to 
note that whether or not a personnel research 
activity is maintained, management will con- 
duct research studies. Staffing with person- 
nel specifically trained for such work pays 
dividends, if in no other way than in mak- 
ing such research sufficiently sound to assure 
management that the conclusions may be ap- 
plied with confidence. 


Time Limit versus Work Limit Methods of Test Administration 


E. B. Knauft 
Aetna Life Affiliated Companies, Hartford, Connecticut 


The majority of mental alertness tests used 
in the employment situation are speed tests 
in which the time factor plays an important 
role. Those of us working in business and 


industry are sometimes asked whether such 
speed tests unduly penalize the “slow and ac- 
curate individual’ who might be a very satis- 
factory worker. 
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Some information relative to this question 
was obtained in the process of revising the 
LOMA-1 Test. This is a 15-minute general 
mental ability test of 236 items (including 
number series, same-opposites, analogies, and 
general information) developed by the Life 
Office Management Association for use in 
member companies. Many studies during the 
past 15 years have demonstrated the validity 
of this test as an aid to selection of clerical 
employees in insurance companies. A revised 
form of this test was recently administered to 
employees who took the test with the usual 
15-minute time limit and then were permitted 
to complete the remaining items with time 
noted, but no time limit. 

Data based on 235 employees in four dif- 


ferent life insurance companies showed that 
scores obtained in the 15-minute time limit 
correlate + .88 with scores on the entire test 
obtained under untimed conditions. The 
mean time required to complete all items was 
30.1 minutes. 

For this sample, it appears that individu- 
als performing relatively poorly on a mental 
alertness test under timed conditions will not 
appreciably change their standing in the 
group when permitted to complete the test 
with no time limit. The 235 employees rep- 
resent a sample of persons hired within the 
past five years and still employed by the four 
companies. The great majority are between 
the ages of 18 and 35 and are high school 
graduates. 


Employee Opinion Surveys 


“Why should we invite employees to criti- 
cize us? They do enough of that anyway 
without being asked.” 

That’s the attitude of many top manage- 
ment officials. . . .* 

But San Diego Gas and Electric Co. is one 
company that calculated the risk. . . . Now 
it says that it’s glad. One reason SDG&E is 
happy with the results is that the employees 
gave the company a pretty good rating. 

1 See McMurry, R. N. Management’s reactions to 


employee opinion polls. J. appl. Psychol., 1946, 30, 
212-219. (Reference added by Editor.) 


But they received some very specific sug- 
gestions. .. . / A total of 3,380 unfavorable 
comments and 1,290 suggestions were made 
by 2,178 employees. . . . The company had 
determined in advance to do something about 
reasonable complaints and it followed up 
quickly. Top management gave full ap- 
proval and support. Adequate assurance of 
anonymity was provided by the Industrial 
Relations Section of California Institute of 
Technology which conducted the survey get- 
ting a 99 per cent return. (Condensed from 
Business Week, November 7, 1953, p. 167.) 
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Lawshe, C. H. Psychology of industrial rela- 
tions. New York: McGraw-Hill Book Co., 
1953. Pp. vii + 350. $5.50. 

During the past decade psychology has had 
a rapidly increasing impact upon industrial 
management through two developments: (1) 
the introduction of psychologists into staff, 
and occasionally line, positions; (2) the train- 
ing of various levels of management in the 
principles of human behavior. This has led 
to two types of publication: those for use in 
training industrial psychologists and those 
which present the findings of psychology to 
the non-psychologist. This volume is an ex- 
ample of the latter. 

Seven authors are involved—two from Pur- 
due (Lawshe & E. J. McCormick), one each 
from the Army (A. J. Drucker), Air Force 
(W. F. Long), and Navy (E. E. Dudek), 
and two from industry (K. Oliver & R. I. 
Dawson). The fifteen chapters deal with the 


usual topics: principles of human behavior, 
motivation, attitudes, placement, training, su- 
pervision, employee complaints, counseling, 
efficiency, wage administration, employee and 


employee-management relations. Although 
the chapters were written by individual au- 
thors, they are remarkably similar in style, 
level of reading difficulty, and point of view 
and there appears to be relatively little over- 
lap of content except when desirable. In 
general, the writing is clear and direct, with 
attempts to define psychological terms by 
means of industrial examples. Each chapter 
concludes with a set of references to journal 
articles or psychological texts. 

So much for the over-all structure of the 
book. The question remains—how well has 
the goal of communicating industrial psychol- 
ogy to the non-psychologist been achieved? 
The only true answer to the question can come 
from actually measuring what effect the study 
of this book by non-psychologists has had 
upon their knowledge, skills, and attitudes in 
human relations. Lacking research findings, 
the reviewer can merely speculate as to the 
book’s probable value. 

The problem in writing for the non-psy- 
chologist is, of course, deciding what one 
wants to communicate. There are several 
possibilities: (1) psychological findings with 


or without the underlying evidence; (2) sug- 
gestions for what to do in industrial settings, 
with or without reference to the principles 
being applied; (3) a point-of-view, derived 
from psychological principles of human be- 
havior. 

In this reviewer's opinion, the authors have 
done a creditable job in presenting a large 
body of facts and principles, backed up with 
sufficient references to research literature. 
However, certain areas to which psychologists 
have devoted considerable thinking and re- 
search are inexplicably omitted or merely 
mentioned in passing, viz., industrial safety, 
democracy in management, executive develop- 
ment, employee rating methods, character- 
istics of the learning curve, transfer of 
training. 

It is difficult to evaluate the “how-to-do-it” 
aspect of this book. There appears to be an 
attempt to present principles, not specific 
applications; the discussions tend to include 
more of the “it’s important to take the fol- 
lowing things into account” type of statement 
than to describe how to take them into ac- 
count. There is a somewhat too frequent de- 
pendence upon a brief raising of a question 
or listing of factors and then a reference to a 
bibliographic item for the details, assuming 
that the reader will go to the sources. 

Possibly psychology really cannot give very 
many specific suggestions for industrial prac- 
tices and that its real contribution is in 
methodology and point of view. If so, this 
book serves a useful purpose in getting across 
to the non-psychologist the basic attitudes 
towards human problems which characterize 
psychology, e.g., “emphasis upon the people 
that work rather than upon the product they 
make,” the importance of satisfying basic hu- 
man needs, the need for a “basic respect for 
human beings and a genuineness of purpose in 
dealing with employees.” To the extent that 
publications of this type stimulate operating 
personnel to examine their fundamental atti- 
tudes toward human behavior they will facili- 
tate the acceptance of programs developed by 
the industrial psychologist and raise the level 
of daily interpersonnel relations. 

A. S. Thompson 


Teachers College, 
Columbia University 
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McFarland, Ross A. Human factors in air 
transportation. New York: McGraw-Hill, 
1953. Pp. xv + 830. $13.00. 

The advent of the airplane and the exten- 
sion of its performance characteristics have 
subjected those occupying this device to 
an unprecedented variety of environmental 
stresses. Extremes and variations of tem- 
perature and accelerative forces are common- 
place in military flight. The tasks of main- 
tenance and operation have required the de- 
velopment of skills unthought of fifty years 
ago. In many ways, the airplane has pro- 
vided a laboratory and a never-ending set of 
problems for the engineer, the physiologist, 
and the psychologist. 

Drawing upon his unique and extensive ac- 
quaintance with practically every aspect of 
commercial and military aviation, Dr. McFar- 
land has written an encyclopedic volume of 
over 800 two-column pages, illustrated with 
well-selected tabular, graphic, and pictorial 
displays. Each chapter is followed by a 
selected bibliography. 

Human Factors covers thoroughly the areas 
of selection, maintenance of proficiency, and 
safety which are implied by its title as well as 
such topics as sanitation and health in air- 
line operations, the care of passengers, and a 
description of medical programs. The dis- 
cussion of physical factors involving circula- 
tory and sensory phenomena is unusually com- 
prehensive. 

The reviewer is impressed by the clear style, 
careful organization, and excellent typography 
of the book. This work would seem to be not 
only a landmark in the area for which it is 
intended but also a valuable source of in- 
formation for those concerned with most 
branches of personnel psychology. Its chief 
drawbacks are likely to be its size and perhaps 
its price of $13.00, although the latter is cer- 
tainly modest for so large a book aimed at a 


imi ience. 
limited audienc George K. Bennett 


The Psychological Corporation, 
New York, New York 


Husband, Richard W. The psychology of 
successful selling. New York: Harper and 
Brothers, 1953. Pp. 306. $3.95. 


This book is directed to all salesmen to aid 
them in their daily work. Its emphasis is 


on sales tactics, from finding your prospects 
through approaching him and overcoming his 
resistance to closing the sale. There is also 
a short section concerning the selection of 
salesmen, helping him to compare his traits 
with those of successful salesmen. 

This book is not intended to be a pro- 
fessional book for psychologists; rather it 
is deliberately designed to be easy, informal 
reading without technical language or ref- 
erence to experiments or statistics. It is ad- 
mittedly based upon reading leading books 
by professors of business and sales personnel, 
sales journals, trade publications, newspapers, 
popular magazines, training manuals of cer- 
tain companies, and the author’s personal 
experience. Drawing upon these sources, the 
book presents a series of rules, principles, 
steps, and laws on how to be effective in each 
phase of selling. These are liberally illus- 
trated with clever examples and entertaining 
anecdotes. Sprinkled with this is advice and 
moralizing based on the personal opinion of 
the author. 

Thus while the author claims this is the 
first general book on salesmanship written by 
a professional psychologist, it is certainly free 
from the concepts and language of the psy- 
chologist. 


You will find.no discussion per se 
of motivation, adjustment, individual differ- 
ences, learning, perception, and so forth. This 
was apparently deliberately omitted in order 
to make the book more appealing and read- 


able for salesmen. There are many para- 
graphs, however, in which the oversimplifica- 
tion has led to statements which the reviewer 
could not accept. The book is a challenge to 
psychologists in that it reveals a large area 
in which practical applied research can still 
pioneer. Many statements of the book are 
based on what “should be” and reasoning 
from analogy; it would be quite difficult to 
support them with evidence from scientific 
references. 

In general, there is little in the book to 
recommend it even to sales managers or sales- 
men over the many other volumes written in 
this field. 


Brent Baxter 
The Prudential Ins. Co. of America, 
Newark, New Jersey 
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Jahoda, Marie, Deutsch, Morton, and Cook, 
Stuart W. Research methods in social rela- 
tions, with especial reference to prejudice; 
Vol. I: Basic processes; Vol. I1: Selected 
techniques. New York: The Dryden Press, 
1951. Pp. x + 421, x + 423-759. $6.00 
(set). 

As indicated in the preface of this pub- 
lication, “This book is in many ways the out- 
come of group effort. The idea of producing 
it arose in a group; it is presented under the 
auspices of a group; its production was 
financed by several groups; it had the edi- 
torial guidance of a group; and it was pro- 
duced by a group.” The sponsoring agency 
for the book was The Society for the Psy- 
chological Study of Social Issues. 

The two volumes themselves show the im- 
pacts of their sponsorship, and of the many 
hands which have been laid upon them. 
There is a sense of urgency in the book’s 
treatment of problems of social intolerance 
and discrimination, and an implication of 
mild, but persistent, exhortation to the reader 
to take constructive steps in combating these 
evils) For the scientific reader the proper 
course is to be found in “action research” 
(participative research directed toward the 
solution of tangible problems), and for the 
social practitioner the recommended course 
is cooperation with the scientific investigator. 

In spite of the instances of apparent over- 
earnestness and occasional naiveness which 
occur in the book, it still remains a useful and 
informative document. The sections on re- 
search planning, on practical issues in re- 
search, and the uses and applications of re- 
search results are admirable. The second 
volume, which consists for the most part of 
separate papers by various contributors, 
should also be noted. Readers desiring short, 
but critical and dependable resumés of topics 
such as scaling concepts, the use of panels, 
and sociometric analyses will find this vol- 
ume a valuable reference. 

The avowed purpose of the book was to 
reach two audiences: the conductors of social 
research and the users of social research. In 
the reviewer’s opinion neither of these specific 
goals has been satisfactorily met, but an- 
other, equally legitimate goal has been. The 
book is too superficial and given over to 


standard illustrations to be of much help to 
the practicing researcher, and seems to be too 
academic and technical to have much appeal 
to the practical man-of-affairs. But the book 
does have a thoroughness, and an informa- 
tive and dependable quality which would 
make it an excellent source book for non- 
specialists, and for students who wish to gain 
a brief, but competent and comprehensive 
overview of this field of research. 
Harrison G. Gough 
University of California, 
Berkeley 


Coombs, C. H. A theory of psychological 
scaling. Ann Arbor: University of Michi- 
gan Press, May, 1952. Pp. vi+ 94. $1.75. 
If you’ve had a hard day measuring atti- 

tudes, don’t expect this small monograph to 

provide an evening’s relaxation. It’s packed 
from cover to cover with non-superfluous ma- 
terial. It is to the author’s credit that he has 
said so much in so short a space; nevertheless, 
persons lacking expertness in scaling theory 
will not digest the contents properly. On the 


other hand, scaling theorists will accept this 
tidbit as a juicy morsel and will soon be look- 
ing for more. 

The theory presented here has been in the 
process of formulation for four years and rep- 
resents the contributions and criticisms of 


many scholars. It has undergone continuous 
modification in response to these criticisms 
and will undoubtedly undergo more. How- 
ever, its publication now “is necessary for 
the presentation of certain consequences of 
practical interest to psychologists and social 
scientists” (page v). 

Roughly speaking, the presentation is made 
in four parts: (1) a general discussion of the 
aspects and problems of psychological meas- 
urement; (2) a listing and brief explanation 
of the definitions and postulates on which 
the theory is based; (3) the development and 
interpretation of genotypic and phenotypic 
parameters; and (4) derivations of the con- 
sequences of various genotypic conditions and 
the application of the theory to several sam- 
ple experiments. 

The effort has been to present a mathe- 
matical model which will satisfy observed be- 
havior. As such, the theory must resolve cer- 
tain fundamental issues such as the question 
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of defining a psychological trait in a mathe- 
matical sense. In order to resolve problems 
of this sort, parallel systems have been de- 
veloped—one formulated at the genotypic 
level and referring to an individual’s inferred, 
underlying abilities and behavior—the other 
formulated at the phenotypic level and re- 
ferring to an individual’s observed, manifest 
behavior. 
tem, then, is to treat information obtained 
from a set of phenotypic observations so as 
to allow inferences at the genotypic level. 
The first five chapters present the theo- 
retical framework for the realization of this 
objective. It is in Chapter VI that Dr. 
Coombs discusses the area of joint scales 
the final relating of the phenotypic to the 
genotypic. And it is here that he must con- 
fess partial defeat, for he comes face to face 
with the problem of the direction of the in- 
ferences we, as theorists, wish to make. Thus, 
starting with certain genotypic conditions, it 
is demonstrated what the consequences must 
be in terms of manifest behavior. Unfor- 
tunately, in the practical situation, it is only 


a hope that we may apply these relationships 


in the opposite direction. From character- 
istics of manifest behavior, we desire to infer 
characteristics of behavior at the genotypic 
level. The author sums up the difficulty, “It 
has not been shown that for this given set of 
parameters or characteristics of the manifest 
data it is mecessary that these and only these 
conditions must characterize the genotypic 
level” (page 52). 

The author openly states that the theory is 
not in final form. By implication, it is his 
hope that this publication will initiate inter- 
est resulting in a wider range of development 
for the theory in both its abstract and real 
aspects. To this end, the monograph repre- 
sents a good start. 

Marvin D. Dunnette 

The University of Minnesota 


New York Academy of Medicine and the 
Josiah Macy, Jr. Foundation. (Transac- 
tions of the Conference on) Morale—and 
the prevention and control of panic. New 
York: New York Academy of Medicine, 
no date. Pp. 75. No price cited. 

This publication is aimed at inspiring wide- 
spread consideration and study of morale and 


The ultimate objective of the sys- © 


panic. Its audience is not defined but it ap- 
pears to be officials who may have responsi- 
bility for controlling public morale and panic. 

The purpose of the conference was to ex- 
plore available knowledge of the problems 
and discover a way of implementing the 
pooled knowledge through action. Conferees 
were 1 Ph.D., a psychologist in a Veter- 
an’s Hospital; 8 M.D.’s, psychiatrists from 
schools, state and national medical associa- 
tions and governmental agencies; and 2 rep- 
resentatives of public information media, an 
official from a radio broadcasting company 
and the editor of a city newspaper. 

The conferees earnestly advocate study of 
the problems and use of the resulting find- 
ings. Meerloo, formerly chief of the Psycho- 
logical Department of the Netherlands Army, 
contributed most of the specific references to 
evidence on the factors affecting and means 
of controlling panic, drawn from his own 
work during World War II. Herbert Brucker, 
Editor of the Hartford Courant, emphasized 
the belief that factual, play-by-play report- 
ing of the news events as they occur is prob- 
ably the greatest contribution public news 
media can make. He argues against at- 
tempted manipulation of the news and ex- 
hortationary releases from government offi- 
cials as having less than good effect upon 
public morale. 

Much conference time was devoted to re- 
cital of personal experiences and to reference 
to incidents ranging from biblical events to 
postwar reactions of German war leaders. 
Various ways of controlling morale and panic 
with varying degrees of success were cited 
with Meerloo’s practical findings being of 
greatest interest. 

To focus the attention of public officials, 
educators, and research personnel on these 
problems appears desirable. Perhaps a joint 
attack, by experimental study of isolated fac- 
tors and by concurrent multifactor study 
with techniques such as were described dur- 
ing the first meeting of the Operations Re- 
search Society of America, would be produc- 
tive. 

In summary, there was little experimental 
evidence or firm knowledge about causes and 
control of public morale and panic disclosed 
during the conference. It is not clear that 
the aim of inspiring widespread consideration 
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and study of morale and panic will be at- 
tained by publication of the transactions of 
the conference. 


Clark L. Hosmer 
United States Air Force 


Traxler, Arthur E., Jacobs, Robert, Selover, 
Margaret, and Townsend, Agatha. Jntro- 
duction to testing and the use of test re- 
sults in public schools. New York: Harper 
and Brothers, 1953. Pp. 113. $2.50. 
This book is designed to serve as a “prac- 

tical, down-to-earth handbook for schools be- 

ginning the use of objective tests, for teacher 
discussion groups, for in-service training pro- 
grams, for persons who have had experience 
with tests but who desire to brush up on the 
simpler fundamentals of testing, and for in- 
troductory classes in tests and measurements.” 

It is a revision of Educational Records Bul- 

letin No. 55, Introduction to Testing and the 

Use of Test Results (Educational Records 

Bureau, 1950) which was prepared primarily 

for independent schools. 

A general discussion of the role of objec- 


tive tests is followed by sections on planning 
a testing program, selection, administration, 
and scoring of tests, and analysis, interpreta- 


tion, recording, and use of test results. Ele- 
mentary concepts from test theory and sta- 
tistics are presented in context. Illustrative 
material is utilized extensively; interpreta- 
tions of data for individual students and 
classes, and copies of score reports, cumula- 
tive record forms, and the like make up a 
substantial portion of the book. Through- 
out, the authors give detailed attention to 
the limitations of objective tests and to cau- 
tions which should be exercised in the inter- 
pretation of test results. 

Each chapter includes a list of references 
for readers who wish to go beyond this intro- 
ductory handbook. For such readers, sup- 
plementary information on score interpreta- 
tion is likely to be of special concern; the use 
of test results for descriptive and compara- 
tive purposes is treated more explicitly than 
is their application in predicting future per- 
formance. 

This brief, nontechnical book should be 
distinctly useful to the groups of readers to- 
ward whom it is directed. Despite its title, 


Reviews 


the revision seems equally appropriate for 
public and independent schools. From the 
standpoint of the former, the more detailed 
discussions of test selection and program 
planning included in the revised edition 
should be of particular interest. 


Marjorie Olsen 
Educational Testing Service, 
Princeton, New Jersey 


Powers, Edwin and Witmer, Helen. An ex- 
periment in the prevention of delinquency: 
The Cambridge-Somerville youth study. 
(With foreword by Gordon W. Allport.) 
New York: Columbia University Press, 
1951. Pp. xliii + 649. $6.00. 

The book is devoted to the description and 
evaluation of a program, or as the authors 
call it, an experiment in the prevention of 
juvenile delinquency. The program had its 
origin in an idea formulated by Dr. Richard 
Clarke Cabot. He believed that the delin- 
quency of boys could be prevented were it 
made possible for them to come under the 
constructive influence of friendly counselors. 

The study was begun in 1935 and termi- 
nated in 1945. In design, the study as ini- 
tially conceived was in the best scientific 
tradition. One group of boys was to be 
given the benefits of counseling while another 
group of boys matched on several variables 
was-to remain untreated. Members of the 
two groups were selected from lists of names 
provided by various sources. School authori- 
ties nominated boys considered as difficult 
and troublesome as well as boys regarded as 
adjusted. Court records were examined for 
names of potential study subjects. Proba- 
tion officers, police officers, social agencies, 
etc., were asked to submit names. Approxi- 
mately 2,000 names were obtained. All boys 
who had passed their twelfth birthday dur- 
ing the period between referral and investi- 
gation were eliminated. Boys who could not 
be found or who were unavailable were also 
eliminated. The names of those remaining 
after this screening were submitted to three 
experts (not members of the project staff) 
for rating on an eleven-point delinquency 
probability scale. The rating process, which 
took fifteen months to complete, provided 
782 candidates for the experiment. Two psy- 
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chologists were then asked to match one boy 
with another on such variables as health, 
intelligence, pe.sonality, home, neighborhood 
and delinquency prognosis. The toss of a 
coin determined which of the matched boys 
would be placed in the experimental group. 
Six hundred and fifty boys were selected and 
divided into two groups of 325 each. 

Boys in the experimental group were as- 
signed to the project staff. Staff work with 
such boys was begun in November, 1937, and 
was finished in May, 1939. Two years later 
it was found that case loads of 35 boys 
placed too great a burden on the counselors 
and in order to facilitate more effective work, 
boys believed to be in no need of service 
were dropped. This “retirement” (65 boys) 
plus the elimination of cases through death 
or mobility out of the project area (113 
boys) and the manpower shortage produced 
by the war which necessitated discharging 
boys from care as they reached their seven- 
teenth birthday (72 boys), resulted in sub- 
jecting the members of the group to varying 
periods of treatment. This in brief is the 


study organized to test Dr. Cabot’s hypothesis. 


The book contains two parts. Powers, the 
author of the first part, describes the project 
and the subjects chosen for treatment. Many 
phases of delinquency and its prevention in 
an urban setting are adequately discussed. 
In this part of the book the reader will also 
find a comprehensive treatment of the many 
problems which arose as the project evolved. 
The second part, written by Witmer, is con- 
cerned with the evaluation of the results of 
the experiment. 

Before indicating the results of the experi- 
ment it seems desirable to describe briefly the 
personnel selected to implement Dr. Cabot’s 
idea. The search for staff was begun in 1935 
and in all over 250 persons were considered 
for the 10 counseling positions. In_ their 
search, the directors of the project were pri- 
marily interested in persons believed to pos- 
sess intelligence, tact, unimpeachable charac- 
ter, and professional experience in dealing 
with people. Those selected were also to 
have faith in the objectives of the project. 
Formal education and training in professional 
social work were not considered a prerequisite 
if the candidate was ‘‘a warm, outgoing per- 


son who had that vital spark so essential in 
human relationships” (page 92). Women, as 
well as men were considered since it was be- 
lieved that the former would be particularly 
useful in dealing with younger boys. Of the 
ten persons chosen to begin the project four 
were women. A total of 19 different counsel- 
ors were employed in the duration of the ex- 
periment; 15 men and 4 women. Of these 
19, two had had experience as boys’ workers, 
two were psychologists, one was a trained 
nurse, eight were professional social workers 
and six others had completed some of the 
academic requirements for a degree in social 
work. 

The counseling staff of the project at- 
tempted to be of service to the boys in many 
ways. In general each counselor was ex- 
pected to learn to know the boys assigned 
to him as completely as possible so as to aid 
the boy to make a more effective adjustment 
to changing life situations. To do this effec- 
tively the counselor needed to be intimately 
acquainted with each boy’s assets and liabili- 
ties. But even more than this the counselors 
helped boys or members of their families to 
find employment, arranged for camp and 
summer placements, advised and counseled 
the boy’s family in respect to his problems, 
procured professional services to remedy the 
boy’s handicaps, taught and encouraged the 
boy to pursue hobbies and wholesome recrea- 
tional activities, etc. Thus, as it may be 
seen, the counseling staff operated in many 
fields and stood ready to aid both the boy 
and his family to meet a variety of needs. 

The results of the experiment indicate that 
little was accomplished. As a matter of fact, 
if we adhere strictly to the data presented, 
the differences in social adjustment of the 
boys in the experimental and control groups 
are insignificant. The services rendered by 
the project appear to be no more effective in 
achieving adjustment than the ordinary events 
in the lives of the boys. It seems apparent 
that delinquency cannot, on the average, be 
prevented by providing the services and coun- 
seling rendered by the project. All of this 
suggests that delinquency and maladjustment 
must be regarded as associated with a va- 
riety of combinations of psychosocial factors 
and any program intended to prevent such 
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deviations must provide different techniques 
to deal with such different combinations of 
factors. 

This is an important book. Much too little 
has been done to put to a rigorous test the 
explicit or implicit assumptions that underlie 
much of what is done in social engineering. 
The experiment reported does this in a fashion 
that renders it an outstanding example of the 
best in social science research. Dr. Cabot’s 
idea failed to produce hoped for results but 
the experiment designed to test the idea is a 
significant contribution to all of the social 
sciences. 

Dr. Allport’s foreword is an exceptionally 
well written preview of the study. 

Elio D. Monachesi 


Department of Sociology, 
University of Minnesota 


Personality: Symposia on topical issues, Vol. 
1, Nos. 3 and 4 (pp. 213-388). New 
York: Grune and Stratton, 1951. 

Of these two numbers, the first contains 11 
articles on Hypnosis and Personality and the 

second seven articles on Hypnotherapy. G. 


W. Williams’ introductory article discusses 
some of the unsolved problems of hypnosis; 
here, as in many other places in this sym- 
posium, the controversial nature of nearly all 


topics in this field is emphasized. Guze 
writes on posthypnotic behavior and suggests 
that a standard experimental situation involv- 
ing responses to posthypnotic suggestions 
might be useful as a diagnostic tool, since 
it would show how subjects handle impulses 
not congruent with their usual behavior. 
True and Stephenson present a very impor- 
tant research article on the EEG, pulse, and 
plantar reflex in age regression and induced 
emotional states, in which they confirm the 
recent finding that the Babinski reflex ap- 
pears in subjects who are regressed to in- 
fancy, but they fail to find EEG changes. 
Harriman reports experiments on automatic 
writing, which resulted in very few conclu- 
sions. Loomis contributes a thorough survey 
of experiments from Bramwell to the present 
on space and time distortion in hypnosis. 
LeCron gives the results of an inquiry among 


hypnotists, which shows that they are gen- 
erally poor subjects. A short but vividly 
written article by Estabrooks discusses pos- 
sible antisocia: uses of hypnosis.  Weitzen- 
hoffer gives a survey of the major investiga- 
tions of transcendence of normal voluntary 
capacities and concludes that such transcend- 
ence is fairly well established and that sug- 
gestions can cause alterations in nearly all 
organismic activities. As frequently happens 
when contributions are invited, a certain 
amount of recently published material is 
warmed up and served again. Christenson, 
for example, has published also in Psychiatry 
(1949) and in Experimental Hypnosis (edited 
by LeCron) articles on dynamics in hypnotic 
induction in addition to the one appearing 
here. A prospective rather than a retrospec- 
tive look is, however, characteristic of the ar- 
ticle by Kline on psychodiagnostic testing; 
of 15 articles cited eight are “in press.” 

In the number on Hypnotherapy there is a 
very useful introductory article by Schneck, 
which includes brief summaries of some of 
the literature. Watkins’ “Hypnotherapy in 
the military setting” offers little to one who 
has read his book. Rosen’s “Radical hypno- 
therapy of apparent medical and surgical 
emergencies” contains four full case reports 
and is largely new material. Kroger gives a 
very thorough account of personality dynam- 
ics and hypnosis in gynecology based upon 
Kroger and Freed’s book. The articles by 
Raginsky (anesthesiology), Heron (dental 
uses), and Abramson (obstetrical uses) were 
to this reviewer tantalizingly brief and gen- 
eral in treatment, and his scanty information 
in these fields was not much increased by 
them. Fuller articles on these topics with 
more concrete descriptions of the procedures 
would have been welcome. 

In spite of the question of multiple publi- 
cation in a time when nearly all outlets for 
publication are crowded, this symposium is 
very valuable; it presents some new material, 
and its summaries and surveys and full bibli- 
ographies make it very useful for the stu- 
dent, investigator, and practitioner. 


Frank A. Pattie 
University of Kentucky 
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Books, monographs, and pamphlets for listing and possible review should be sent to Donald G. Paterson, 
Editor, Department of Psychology, University of Minnesota, Minneapolis 14, Minnesota 


Maternal dependency and schizophrenia. Jo- 
seph Abrahams and Edith Varon. New 
York: International Universities Press, 
1953. Pp. 240. $4.00. 


The design of social research. Russell L. 
Ackoff. Chicago: The University of Chi- 
cago Press, 1953. Pp. 376. $7.50. 

Personality fundamentals for administrators. 
Chris Argyris. New Haven: Labor and 
Management Center, Yale University, 1953. 
Pp. 123. 

Roles and relationships in counseling. Ralph 
H. Berdie, Editor. Minneapolis: Univer- 
sity of Minnesota Press, 1953. Pp. 37. 
$1.25. 

The social theories of Harry Stack Sullivan. 
Dorothy R. Blitsten. New York: The 
William-Frederick Press, 1953. Pp. 186. 
$3.50. 

Design for decision. Irwin D. J. Bross. 
York: The Macmillan Company, 
Pp. 276. $4.25. 


Current theory and research in motivation. 
Judson S. Brown, et al. Lincoln: Univer- 
sity of Nebraska Press, 1953. Pp. 193. 
$2.00. 

Professional problems in psychology. 
S. Daniel and C. M. Louttit. New York: 
Prentice-Hall, Inc., 1953. Pp. 416. $5.50. 

Political community at the international level: 
problems of definition and measurement. 
Karl W. Deutsch. Princeton: Princeton 
University Press, 1953. Pp. 71. 

Steps in psychotherapy. John Dollard, Frank 
Auld, Jr. and Alice Marsden White. New 
York: The Macmillan Company, 1953. 
Pp. 222. $3.50. 

Structure of human personality. H. J. Ey- 
senck. New York: John Wiley & Sons, 
Inc., 1953. Pp. 348. $5.75. 

Farnum music notation test. 
Farnum. 
poration. 


New 
1953. 


Robert 


Stephen E. 
New York: Psychological Cor- 
Pp. 11. 
Symposium on fatigue. W. 
A. T. Welford, Editors. 
Lewis & Co. Ltd., 1953. 


F. Floyd and 
London: H. K. 
Pp. 196. 24s net. 
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Psychiatry and military manpower policy. 
Eli Ginzberg, John L. Herma, and Sol W. 
Ginsburg. New York: King’s Crown Press, 
1953. Pp. 66. $2.00. 

Sample survey methods and theory. Volume 
I. Morris H. Hansen, William N. Hurwitz, 
and William G. Madow. New York: John 
Wiley & Sons, Inc., 1953. Pp. 638. 

Sample survey methods and theory. Volume 
II. Morris H. Hansen, William N. Hur- 
witz, and William G. Madow. New York: 
John Wiley & Sons, Inc., 1953. Pp. 332. 
$7.00. 

How to take a test. 
cago: Science 
Pp. 47. $.40. 

Developmental psychology. Elizabeth B. 
Hurlock. New York: McGraw-Hill Book 
Company, 1953. Pp. 556. $6.00. 

C. G. Jung. New 

Bollingen Series, 1953. Pp. 342. 


Chi- 
1953. 


Joseph C. Heston. 
Research Associates, 


Psychological reflections. 
York: 
$4.50. 

A court for children. Alfred J. Kahn. New 
York: Columbia University Press, 1953. 
Pp. 359. $4.50. 

Sexual behavior in the human female. Alfred 
C. Kinsey, Wardell B. Pomeroy, Clyde E. 
Martin, and Paul H. Gebhard. Philadel- 
phia: W. B. Saunders Company, 1953. Pp. 
842. $8.00. 

Hypnotism for professionals. Konradi Leit- 
ner. New York: Stravon Publishers, 1953. 
Pp. 127. $4.00. 

Films in psychiatry, psychology and mental 
health. Adolf Nichtenhauser, Marie L. 
Coleman, and David S. Ruhe. New York: 
Health Education Council, 1953. Pp. 269. 
$6.00. 

Applied imagination. Alex F. Osborn. 
York: Charles Scribner’s Sons, 1°53. 
317. $3.75. 

New light on dreams. Max Serog. Boston: 
The House of Edinboro, Publishers, 1953. 
Pp. 159. $3.00. 

Group relations at the crossroads. Muzafer 
Sherif and M. O. Wilson, Editors. New 


New 
Pp. 
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York: Harper and Brothers, 1953. Pp. 


379. $3.50. 

Lawless youth. E. A. Stephens. New York: 
Pageant Press, 1953. Pp. 315. $3.50. 

The study of behavior. William Stephenson. 
Chicago: University of Chicago Press, 1953. 
Pp. 376. $7.50. 

Outline of executive development. Lee Stock- 
ford. Pasadena: California Institute of 
Technology, 1953. Pp. 46. $2.00. 

Living with a disability. Eugene J. Taylor 
and Howard A. Rusk. New York: The 


Blakiston Company, Inc., 1953. 
$3.50. 

The work of a counselor. Leona E. Tyler. 
New York: Appleton-Century-Crofts, Inc., 
1953. Pp. 323. $3.00 

Recruiting the college graduate: A guide for 
company interviewers. Richard S. Uhr- 
brock. New York: American Management 
Association, 1953. Pp. 31. $1.25. 

How to help people. Rudolph M. Witten- 
berg. New York: Association Press, 1953. 
Pp. 64. $1.00. 
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MERIT RATING SERIES. Five standardized merit rating forms: Cler- 
ical, Mechanical, Sales, Technical, Supervisor. Each form uses 60 specific state- 
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PSYCHOLOGY APPLIED TO LIFE AND WORK, Second Edition (1950) 


By HARRY WALKER HEPNER, Professor of Psychology, Syracuse University; 
Consultant in Personnel and Consumer Relations 
A famous text used in 400 colleges, universities and technical schools for courses in A — 
Psychology, Business Psychology, Personnel Management, Executive Training, Psy- 
chology in Industrial Management (orientation for engineers) and General Psychology 
where Applied Psychology is stressed. 
Applies the adjustment concept to personal and business problems; unifies facts and 
methods with a binding thread of theory. 


Shows how to get along with so a develop emotional maturity, direct self-growth, 
and intelligently supervise employees in industry. 
Features full-length discussion of the adjustment concept, vivid illustrative examples, 
recent findings on group dynamics in industry. 
724 pp. © 629 illus. * 1950 
Student’s Workbook 80 pp. 84''211'' Published 1950 
Teacher’s Manual—Free on adoplion (restricted) 128 pp., 6"’ z 9’ 


MENTAL HYGIENE, 2nd Ed. (1951) The Dynamics of Adjustment 


By HERBERT A. CARROLL, Head of Department of Psychology, University of 
New Hampshire 


A practical text adopted at over 150 schools. 
The universal nature of human needs and the conflicts that arise from frustration of 
them are discussed to help students acquire flexible habits of adjustment. Examples 


are drawn from the author's long teaching and clinical experience with students. Causa- 
tion is stressed. 


Emphasis is on the role of direct experience and the form such experience takes in 
determining behavior. 

General introductory material is included on motivation, individual differences, learning 
and psychometrics. 


Psychoses are discussed briefly to show the relationship between normal and abnormal 
behavior. 


1951 * 448 pp. * 5144" 2814" 


SOCIAL PSYCHOLOGY 
By SOLOMON E. ASCH, Professor of Psychology, Swarthmore College 


A beautifully written, critically conceived and integrated text which explores basic 
assumptions of social psychology theory with a constructive approach. Takes nothing 
for granted, presenting lucidly the established as well as recent findings about special 
contexts of human perception, emotion, thinking and action. Emphasizes interaction 
as a central concept and provides penetrating insights into both individual and group 
phases of modern life. Called the best evaluation and criticism of various schools of 
psychology of any book in the field. 

1952 * 646 pp. * 6" 29" - illus. 
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